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Computational  tools  and  resources  for 
metabolism-related  property  predictions. 
I.  Overview  of  publicly  available  (free  and 
commercial)  databases  and  software 


Metabolism  has  been  identified  as  a  defining  factor  in  drug  development  success  or  failure  because  of  its  impact 
on  many  aspects  of  drug  pharmacology,  including  bioavailability,  half-life  and  toxicity.  In  this  article,  we  provide  an 
outline  and  descriptions  of  the  resources  for  metabolism-related  property  predictions  that  are  currently  either 
freely  or  commercially  available  to  the  public.  These  resources  include  databases  with  data  on,  and  software  for 
prediction  of,  several  end  points:  metabolite  formation,  sites  of  metabolic  transformation,  binding  to  metabolizing 
enzymes  and  metabolic  stability.  We  attempt  to  place  each  tool  in  historical  context  and  describe,  wherever 
possible,  the  data  it  was  based  on.  For  predictions  of  interactions  with  metabolizing  enzymes,  we  show  a  typical 
set  of  results  for  a  small  test  set  of  compounds.  Our  aim  is  to  give  a  clear  overview  of  the  areas  and  aspects  of 
metabolism  prediction  in  which  the  currently  available  resources  are  useful  and  accurate,  and  the  areas  in  which 
they  are  inadequate  or  missing  entirely. 


There  is  immense  pressure  on  all  developers 
of  new  drugs  today  to  efficiently  and  cost- 
effectively  produce  compounds  with  efficacies 
better  than  existing  therapies  and  with  very  lim¬ 
ited  adverse  effects.  This  requires  the  simultane¬ 
ous  optimization  of  both  the  desired  bioactivity 
and  the  absorption,  distribution,  metabolism, 
excretion  and  toxicity  (ADMET)  properties  of 
a  drug  candidate.  Of  the  ADMET  processes, 
metabolism,  and  specifically  metabolic  stability, 
has  been  identified  as  a  defining  characteristic 
in  drug  development  success  or  failure  due  to  its 
overall  impact  on  compound  pharmacokinetics. 
Therefore,  metabolic  information  about  com¬ 
pounds  in  the  drug-discovery  pipeline  is  cru¬ 
cial  to  their  development  as  drugs.  For  example, 
extensive  first-pass  metabolism  can  contribute 
to  low  bioavailability,  while  metabolism  that 
occurs  too  rapidly  can  cause  a  short  therapeutic 
window  requiring  a  frequent  dosing  schedule. 
Conversely,  metabolism  that  proceeds  too  slowly 
can  cause  an  accumulation  of  drug  in  the  body 
and  increases  the  risk  of  toxic  effects.  Inhibition 
or  induction  of  the  CYP450  enzymes,  which 
catalyze  the  majority  of  metabolic  reactions  [1], 
can  cause  adverse  drug  interactions.  In  some 
cases,  the  metabolites  of  a  compound  can  be 
toxic  or  reactive,  or  can  themselves  exhibit 
bioactivities  that  may  differ  from  their  parent 
molecule  [2] . 


The  importance  of  a  full  understanding  of 
metabolite  formation  and  interactions  with 
CYPs  became  clear  in  the  1990s  when  the  anti¬ 
histamine  terfenadine  (marketed  as  Seldane  in 
the  USA)  was  implicated  in  life-threatening 
cardiotoxic  drug-drug  interactions  with  keto- 
conazole,  an  antifungal  drug  [3].  Terfenadine 
is  a  prodrug  that  ordinarily  is  rapidly  metabo¬ 
lized  by  CYP  3A4  into  its  active  carboxylated 
metabolite.  However,  in  the  presence  of  keto- 
conazole  or  other  drugs,  such  as  macrolide  anti¬ 
biotics,  that  inhibit  CYP  3A4,  the  concentration 
of  the  parent  compound  can  rise  to  toxic  levels 
[4] .  Unfortunately,  for  Hoechst  Marion  Roussel 
(now  Sanofi-Aventis),  the  makers  of  terfenadine, 
the  active  carboxylated  metabolite  fexofenadine 
had  meanwhile  been  patented  by  another  com¬ 
pany,  and  Hoechst  Marion  Roussel  was  forced 
to  buy  back  the  development  rights  in  order  to 
market  it  (as  Allegra®  in  the  USA)  [5] .  This  inci¬ 
dent  led  to  a  new  appreciation  of  hERG  inhibi¬ 
tion  as  a  mechanism  for  drug  toxicity  [6,7]  and 
new  requirements  from  the  US  FDA  for  a  char¬ 
acterization  of  the  metabolites  of  any  new  drug 
candidate  along  with  in  vitro  measurements  of 
CYP  inhibition  and  induction  [4] . 

The  1990s  were  also  the  heyday  of  hope  and 
hype  for  the  new  drug-discovery  technologies  of 
combinatorial  chemistry  and  high-throughput 
screening,  and  the  development  and  testing  of 
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large  screening  libraries.  It  became  clear  that  in 
order  to  realize  the  potential  of  the  large  num¬ 
bers  of  hits  coming  out  of  high-throughput 
screens,  the  determination  of  ADMET  proper¬ 
ties  would  also  have  to  be  carried  out  more  rap¬ 
idly  and  efficiently  than  the  in  vivo  screens  that 
were  standard  for  the  time  [8,9] .  This  led  to  the 
development  of  new,  faster  in  vitro  metabolism 
screening  methods  [10]  as  well  as  even  more  rapid 
and  inexpensive  in  silico  models  [it] . 

Although  in  vitro  screening  methods  can  be 
fairly  accurate  for  determining  metabolic  prop¬ 
erties,  they  have  several  limitations,  includ¬ 
ing  the  cycle  time  and  the  inefficiency  when 
screening  large  numbers  of  compounds  [12],  not 
to  mention  the  significant  cost  when  applied  to 
large  screening  libraries.  In  silico  screening  can 
help  to  alleviate  the  strain  of  large  numbers  of 
samples  on  the  in  vitro  methods  by  character¬ 
izing,  categorizing  and,  thus,  potentially  elimi¬ 
nating  metabolically  unstable  compounds  from 
the  collection  of  candidates  early  on  in  the 
drug-discovery  process  [13] . 

Here,  we  describe  databases  (Table  I)  and  soft¬ 
ware  (Table  2)  that  can  be  obtained  and  used  by 
the  general  (scientific)  public,  namely,  resources 
that  are  either  marketed  as  commercial  prod¬ 
ucts,  or  that  are  freely  or  even  openly  available 
(Box  I).  Dozens,  if  not  hundreds,  of  metabolism 
prediction  methods  can  be  found  in  the  litera¬ 
ture,  and  reviews  thereof  are  published  quite 
frequently  [13-15] .  Many  of  the  QSAR  models 
described  in  the  literature  have  been  developed 
by  pharmaceutical  companies,  on  large  propri¬ 
etary  datasets,  using  proprietary  descriptors  and/ 


or  ‘black-box’  machine  learning  methods.  Such 
studies  may  be  an  illustration  of  what  is  pos¬ 
sible  given  enough  good  and  consistent  data. 
However,  for  the  drug  developer  at  a  different 
company  or  institution  desiring  a  simple  off-the- 
shelf  method  to  use  in  an  ongoing  fast-paced 
research  project,  most  of  these  studies  are  not 
immediately  useful. 

This  article  focuses  on  a  bottom-up,  individ¬ 
ual  compound-driven  approach  to  metabolism 
predictions,  as  opposed  to  the  top-down  sys¬ 
tems  biology  approach,  which  is  currently  called 
‘metabolomics.’  We  are,  therefore,  not  generally 
including  here  tools  that  are  used  in  metabo¬ 
lomics  research,  such  as  those  for  metabolite/ 
biomarker  identification  and  metabolic  profiling 
studies.  We  are  also  not  including  tools  for  what 
could  be  called  environmental  metabolism  pre¬ 
dictions,  that  is,  biodegradation  or  biocatalysis 
by  bacteria.  Many  of  the  commercial  software 
programs  listed  here  provide  other  functionality 
and  can  calculate  additional  ADMET  properties 
other  than  what  is  described  here.  We  do  not  aim 
to  provide  a  complete  description  of  each  pack¬ 
age,  only  the  portions  of  it  relative  to  metabolism 
predictions. 

The  metabolism  end  points  that  are  currently 
predicted  by  in  silico  methods  can  be  divided 
into  two  general  types.  The  first  type  consists  of 
predictions  of  the  effects  of  metabolism  on  the 
structure  of  a  compound  itself,  such  as  a  list  of 
its  metabolites,  the  reactions  that  the  compound 
will  undergo  (biotransformations),  or  the  atomic 
sites  where  metabolic  reactions  are  most  likely  to 
take  place  (regiolability  or  regioselectivity).  The 


Table  1.  Publicly  available  databases  and  datasets  with  metabolism-related  content. 


Name 

Company/institution 

Availability 

Compound  Source 

Properties 

Ref. 

ADME  DB 

Fujitsu 

Commercial 

Literature 

Metabolizing  enzymes, 
reaction  data 

[217,218] 

AurSCOPE  ADME 

Aureus  Sciences 

Commercial 

Literature,  patents, 
drugs 

Metabolizing  enzymes, 
reaction  data,  metabolic 
stability 

[260] 

BioPrint 

Cerep 

Commercial 

Drugs 

Measured  bioactivities 

[89,257] 

ChEMBL 

European  Bioinformatics 
Institute 

Free,  open 

Literature 

Bioactivities 

[93,264] 

DrugBank 

University  of  Alberta 

Free,  open 

Drugs 

Pharmacology, 
metabolizing  enzymes 

[91,259] 

MetaBase 

GeneGo 

Commercial 

Literature 

CYP  substrates 

[29,203] 

Metabolite 

Accel  rys 

Commercial 

Literature 

Biotransformations 

[201] 

Microsomal  Stability 

Evolvus 

Commercial 

Literature 

Clearance,  half-life 

[262] 

PubChem 

NCBI,  NIH 

Free 

Various 

Measured  bioactivities 

[66,233] 

QSAR  World 

Strand  Life  Sciences 

Free 

Literature 

Clearance,  half-life 

[263] 

WOMBAT-PK 

Sunset  Molecular 
Discovery 

Commercial 

Literature,  drugs 

Pharmacokinetic  data 

[90,258] 
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Table  2.  Publicly  available  software  for  metabolism-related  predictions. 


Name 

Company/Institution 

Availability 

Prediction 

Method 

Ref. 

ACD/Percepta 

ACD/Labs 

Commercial 

Metabolic  reaction  sites,  CYP 
substrates  and  inhibitors 

QSAR  models 

[50,51,224] 

ADMET 

Descriptors/ 

Collection 

Accel  rys 

Commercial 

CYP  2D6  inhibition 

QSAR  model 

[71,251,252] 

ADMET 

Predictor 

Simulations  Plus 

Commercial 

Metabolic  reaction  sites,  CYP 
substrates,  inhibitors,  kinetics, 
UGT  substrates,  clearance 

QSAR  models 

[211] 

ADMEworks 

Predictor 

Fujitsu 

Commercial 

CYP-binding  affinities 

QSAR  models 

P53] 

CypScore 

CAChe  Research 

Free 

Metabolic  reaction  sites 

QSAR  models 

[43,219] 

isoCYP 

Molecular  Networks 

Commercial 

CYP  substrates 

QSAR  models 

[75,254] 

MetabolExpert 

CompuDrug 

Commercial 

Biotransformations 

Logical  rules 

[30,205] 

Metabolizer 

ChemAxon 

Beta 

'preview' 

Biotransformations 

Reaction  enumeration 

[216] 

MetaDrug 

GeneGo 

Commercial 

Biotransformations,  CYP 
substrates  and  inhibitors 

QSAR  models 

[29,204] 

META- PC 

MultiCASE 

Commercial 

Biotransformations 

Fragmentation,  dictionary  rule 
application 

[32,34,207] 

MetaPrint2D 

University  of  Cambridge 
(UK) 

Open  source 

Metabolic  reaction  sites, 
biotransformations 

Fingerprint  counting 

[53,54,228] 

MetaSite 

Molecular  Discovery 

Commercial 

Metabolic  reaction  sites, 
metabolites 

Structural  alignment  plus 
semi-empirical  calculations 

[39,214] 

METEOR 

Lhasa 

Membership 

Biotransformations 

Rule  application  plus  logical 
reasoning 

[35,36,208] 

MEXAIert 

CompuDrug 

Commercial 

First-pass  conjugation 

Rule  application 

[267] 

P450  SOM 

Schrodinger 

Commercial 

Metabolic  reaction  sites 

Docking  plus  rules 

[56,231] 

PASS 

GeneXplain 

Commercial 

CYP  substrates,  inhibitors, 
inducers,  phase  II  substrates 

QSAR  models 

[69,247] 

QikProp 

Schrodinger 

Commercial 

Number  of  biotransformations 

SMARTS  matching 

[268] 

RS-Predictor 

Rensselaer 

Free 

Metabolic  reaction  sites 

QSAR  models 

[46,48,223] 

SMARTCyp 

University  of 

Copenhagen  (Denmark) 

Free 

Metabolic  reaction  sites 

Ab  initio  calculations  (fragment 
look-up)  plus  structural 
measurements 

[44,45,222] 

StarDrop 

Optibrium 

Commercial 

Regiolability,  CYP  binding 
affinity 

Semi-empirical  calculations, 
QSAR  models 

[209] 

TIMES 

Burgas  'Prof.  Assen 
Zlatarov'  University 
(Bulgaria) 

Unknown 

Biotransformations 

Rule  application,  QSAR  models 

[40,41,215] 

VirtualToxLab 

Biograf  3R/University  of 
Basel  (Switzerland) 

Commercial 

CYP  and  nuclear  receptor 
binding  affinities 

Docking  plus  QSAR  models 

[78,255] 

second  type  of  predictions  are  concerned  with 
which  enzymes  the  compound  will  interact  with, 
and  include  the  classification  of  CYP  substrates, 
inhibitors  and  inducers,  as  well  as  predictions 
of  whether  the  compound  will  also  (or  instead) 
be  metabolized  by  non-CYP  enzymes  such  as 
glucuronosyltransferases  and  sulfo transferases. 
A  third  type  of  prediction,  which  we  discuss  in 
more  detail  in  [16],  involves  estimation  of  a  com¬ 
pound’s  overall  metabolic  stability,  that  is,  of  its 
half-life  or  clearance  rate,  measured  either  in  vivo 
or  in  an  in  vitro  model. 


Effects  of  metabolism  on  compound 
structures 

The  prediction  of  metabolites  was  the  first  com¬ 
putational  end  point  for  which  software  and 
databases  became  available,  as  early  as  the  1980s, 
due  to  the  advent  of  personal  computers  and  the 
publication  of  several  textbooks  that  began  to 
organize  and  categorize  knowledge  about  meta¬ 
bolic  reactions  [1,17,18].  These  methods  generally 
consist  of:  a  set  of  rules,  extracted  either  directly 
from  the  literature  or  indirectly  from  the  litera¬ 
ture  via  a  database;  a  method  for  applying  the 
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Box  1.  Definitions  of  the  various  levels  of  accessibility  that  can  be 
provided  for  software  and  databases. 


We  provide  definitions,  as  used  in  this  paper,  of  the  various  levels  of  accessibility  that 
can  be  provided  for  software  and  databases.  Note  that  these  are  not  necessarily  the 
same  as  the  definitions  used  by  the  Open  Source  Initiative  or  the  GNU  Project. 
Academic:  resource  developed  by  an  academic  or  nonprofit  group  that  is  available 
for  free  to  other  nonprofit  researchers,  and  in  exchange  for  money  to  anyone  else. 

■  Commercial  products:  resource  that  is  available  to  anyone  in  exchange  for  money. 
Freely  available:  resource  that  is  available  to  anyone  and  costs  no  money. 

Openly  available:  database  or  software  program  for  which  the  underlying  data  or 
source  code  is  available  to  anyone  for  examination,  modification,  or  incorporation 
into  other  resources  as  desired. 

Proprietary:  private  resource  that  is  only  available  to  certain  people  (i.e.,  its 
developers  or  the  employees  of  a  company). 

Publicly  available:  resource  that  can  be  accessed  or  used  by  anyone,  regardless  of 
affiliation,  either  for  free  or  in  exchange  for  money. 

rules  to  an  input  compound;  and  a  method  for 
deciding  when  to  stop  applying  them,  that  is,  a 
way  of  ranking  or  ordering  the  combinatorial 
explosion  of  predicted  metabolites.  Many  pro¬ 
grams  can  also  provide,  either  directly  or  via 
another  program,  an  estimation  of  the  toxicity 
for  each  predicted  metabolite. 

Regioselectivity  or  metabolic  reaction  site  pre¬ 
dictions  are  similar  in  spirit  to  the  prediction  of 
biotransformations,  because  each  metabolic  reac¬ 
tion  must  occur  at  a  site,  though  not  every  site  can 
undergo  a  reaction.  Some  software  programs  can 
predict  both  reactions  and  sites,  and  the  under¬ 
lying  data  for  building  models  are  generally  the 
same.  Site  prediction  methods  fall  into  two  cat¬ 
egories:  those  that  use  a  calculation  (or  fragment- 
based  look-up)  of  the  quantum  chemical  reactivity 
of  each  atom  in  the  query  molecule,  usually  along 
with  a  steric  accessibility  factor,  and  those  that  rely 
on  some  form  of  pattern  matching  to  datasets  of 
experimentally  observed  metabolic  reaction  sites. 
More  detailed  academic  and  proprietary/private 
methods  have  been  reported  in  the  literature  [15.19] , 
involving  high-level  quantum  calculations,  dock¬ 
ing,  homology  modeling  of  CYP  structures  and 
molecular  dynamics,  but  few  of  these  have  been 
implemented  as  publicly  available  software. 

For  further  details  on  some  of  the  programs 
discussed  here,  Kulkarni  et  al.  [20]  gives  an  in- 
depth  review  and  comparison  of  the  programs 
MetabolExpert,  META,  Meteor  and  TIMES. 
Computational  methods  for  predicting  sites  of 
metabolism  were  reviewed  recently  and  com¬ 
prehensively  by  a  group  at  Gedeon  Richter  in 
Hungary  [14],  and  specific  comparisons  of  a 
few  commercial  programs  were  done  by  groups 
in  Belgium  at  Ghent  University  and  Janssen 
Pharmaceutica  [21],  and  at  Genentech  [22]. 


■  Databases 
Metabolite 

The  oldest  and  largest  commercial  database  con¬ 
taining  biotransformation  reactions,  including 
parent  compounds  and  metabolites,  is  Metabolite 
[201],  a  database  of  biotransformations  that  was 
developed  by  MDL  Information  Systems  in  1994 
[23].  MDL  was  purchased  by  Elsevier  in  1997, 
then  sold  to  Symyx  in  2007.  The  Metabolite 
database  is  now  available  commercially  through 
Accelrys,  which  merged  with  Symyx  in  2010 
(this  database  should  not  be  confused  with  the 
Accelrys  Metabolism  database  [202]  originally 
called  Biotransformations,  and  first  developed 
by  Synopsys,  which  merged  into  Accelrys  in 
2001.  Metabolism  was  based  on  data  extracted 
from  two  publications  by  the  Royal  Society  of 
Chemistry:  Biotransformations  [24]  and  Metabolic 
Pathways  of  Agrochemicals  [25] .  Unfortunately,  it  is 
no  longer  commercially  available). 

Metabolite  was  originally  based  on  data 
from  a  German  book  series,  Biotransformation 
von  Arzneimitteln  [26],  and  the  journal 
Pharmacokinetics  [27].  It  was  intended  to  be 
a  wide  collection  of  metabolism  data  with  a 
focus  on  quantity  rather  than  quality  [23]  and 
now  includes  compounds  from  New  Drug 
Applications,  proceedings  from  meetings  of 
the  International  Society  for  the  Study  of 
Xenobiotics  and  the  scientific  literature  from 
1990  onward  [11].  These  are  mainly  pharma¬ 
ceutical  compounds,  but  also  include  food  addi¬ 
tives,  industrial  chemicals  and  agrochemicals 
[28],  The  latest  version  at  the  time  of  writing, 
2011.2,  contains  62,465  molecules  and  103,907 
biotransformation  reactions,  of  which  36,041 
are  in  humans.  The  database  can  be  searched 
by  structure,  substructure  or  similarity  to  any  or 
all  of  the  parent,  substrate  (pathway  intermedi¬ 
ate)  or  metabolite  compounds,  or  by  a  reaction 
query  to  the  biotransformations.  For  some  reac¬ 
tions  in  the  database,  the  enzyme  responsible  for 
catalysis  is  listed,  along  with  information  about 
the  reaction  time  and  whether  the  compound 
can  act  as  an  inhibitor  or  an  inducer  as  well  as 
a  substrate  [201] .  Data  from  both  the  Metabolite 
and  Metabolism  databases  have  been  used  exten¬ 
sively  in  the  development  of  many  software  pro¬ 
grams  for  predicting  both  metabolites  and  sites 
of  metabolism,  as  will  be  shown  below. 

MetaBase 

Other  commercially  available  databases  of 
biotransformations  include  MetaBase,  from 
GeneGo,  which  is  a  large  database  that  includes 
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literature  data  on  small  molecule-protein, 
protein— protein  and  protein— DNA  interactions, 
signaling  pathways,  regulatory  networks  and 
diseases.  As  of  May  2012,  it  contains  692,425 
chemical  compounds,  44,171  proteins  and 
1,076,985  interactions.  Relevant  to  metabolism 
predictions,  it  also  contains  a  set  of  9048  meta¬ 
bolic  reactions  on  xenobiotic  and  endogenous 
compounds  [203].  The  program  MetaDrug,  first 
released  in  2004,  is  a  software  platform  and 
graphical  interface  to  the  MetaBase  database 
[204].  In  MetaDrug,  a  small  set  of  81  ‘metabolic 
rules’  covering  phase  I  and  II  enzymes  can  be 
applied  to  input  query  compounds,  and  the  pre¬ 
dicted  metabolites  are  prioritized  according  to 
their  occurrence  frequencies  in  MetaBase  [29] .  A 
set  of  QSAR  models  for  various  ADMET  pre¬ 
dictions  can  then  be  applied  to  the  set  of  input 
compounds  and  their  predicted  metabolites,  as 
will  be  discussed  in  the  next  section. 

■  Software 
MetabolExpert 

The  earliest  computer  program  for  metabolite 
prediction  was  MetabolExpert  [30],  first  released 
in  1987  by  the  company  CompuDrug  [205],  orig¬ 
inally  founded  in  Hungary.  It  was  based  on  the 
metabolic  pathways  described  in  a  textbook  [17], 
which  had  been  organized  according  to  func¬ 
tional  groups  and  substructures  rather  than 
chemical  series.  This  allowed,  for  the  first  time, 
the  possibility  of  deriving  the  biotransformations 
that  were  likely  to  occur  on  an  entirely  new  com¬ 
pound  [16],  and  thus,  the  formalizing  of  a  set  of 
logical  rules  for  metabolism  in  an  expert  system. 
In  MetabolExpert,  the  rules  are  programmed  as 
logical  statements  in  Prolog  and  are  applied  suc¬ 
cessively  to  the  input  molecule  to  generate  a  tree 
of  predicted  metabolites  [30] .  The  set  of  rules  was 
later  augmented  with  data  from  other  textbooks 
and  the  literature  [206] . 

META 

Another  early  program  was  META,  part  of  the 
Computer  Automated  Structure  Evaluation 
(CASE)  system  developed  by  Gilles  Klopman 
at  Case  Western  Reserve  University  (USA)  in  the 
mid-1980s  [31] ,  and  now  licensed  by  MultiCASE, 
Inc.  [207]  (META  was  originally  written  for 
VMS,  in  1994,  hence  the  name  META-PC 
for  the  Windows  version).  META  consists  of 
a  program  for  breaking  down  an  input  query 
molecule  into  fragments,  then  applying  a  dic¬ 
tionary  of  transformations  to  generate  metab¬ 
olites.  There  are  four  dictionaries  available: 


mammalian  metabolism,  aerobic  and  anaerobic 
degradation  by  bacteria  (for  environmental  tox¬ 
icity  predictions),  and  photodegradation.  The 
program  recognizes  when  an  intermediate  is 
chemically  unstable  and  applies  further  trans¬ 
formations  from  a  fifth  ‘spontaneous’  dictionary 
until  a  stable  chemical  compound  is  achieved 
[32] .  The  mammalian  metabolism  dictionary  for 
META  [33]  was  built  to  be  comprehensive,  and  to 
include  only  ‘well-established’  data  from  a  set  of 
textbooks  dating  from  the  mid-1970s  to  the  early 
1990s.  A  genetic  algorithm  was  incorporated  into 
the  program  to  prioritize  the  transformations 
in  the  dictionary  so  as  to  be  able  to  reproduce 
experimental  data  on  observed  metabolites  [34] . 
The  output  from  META  is  a  tree  of  predicted 
metabolites,  pruned  at  the  point  where  the  logP 
of  the  compounds  is  low  enough  that  they  can 
presumably  be  excreted  by  the  kidneys  [32] . 

Meteor 

Meteor  was  developed  beginning  in  1997  by 
Lhasa,  Ltd  [208]  as  a  logical  outgrowth  of  their 
Derek  system  for  toxicity  prediction,  as  a  means 
of  dealing  with  chemicals  that  are  nontoxic  in 
their  original  form  but  run  the  risk  of  being  con¬ 
verted  into  toxic  metabolites  in  the  body  [35]. 
Meteor  consists  of  a  knowledge  base  of  bio  trans¬ 
formation  reactions,  collected  from  the  Accelrys 
Metabolism  database  [202],  from  the  literature, 
and  via  consultation  with  human  experts  from 
industry,  regulatory  agencies,  and  academia  [36] . 
The  biotransformation  reactions  are  applied  to 
input  compounds  and  their  logP  is  calculated. 
The  resulting  tree  of  metabolites  is  subjected 
to  an  absolute  reasoning  process  to  evaluate 
the  qualitative  probabilities  for  each  reaction 
(probable,  plausible,  equivocal,  doubted  and 
improbable)  as  a  function  of  the  substrate’s  lipo- 
philicity,  and  a  relative  reasoning  process  to  rank 
order  the  predicted  metabolites  according  to 
regioselectivity  data  in  the  knowledge  base  [36] . 

StarDrop 

StarDrop  has  the  distinction  of  being  simulta¬ 
neously  one  of  the  oldest  and  one  of  the  newest 
software  programs  for  metabolic  site  prediction. 
It  began  life  in  the  Camitro  Corporation,  which 
was  founded  in  1998  at  the  height  of  the  dot-com 
boom,  and  offered  its  suite  of  ADME  predic¬ 
tion  models  over  a  secure  internet  connection 
[11] .  Camitro  merged  with  ArQule  in  2001,  and 
in  2003,  ArQule  sold  its  ADME  capabilities  to 
Inpharmatica,  who  named  the  ADME  program 
Admensa  Interactive.  Inpharmatica  was  bought 
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by  BioFocus  in  2006,  and  initially  kept  the  name 
Admensa,  but  in  2008  the  program  was  renamed 
StarDrop.  Finally,  the  company  Optibrium  was 
spun  off  from  BioFocus  in  2009,  for  the  sole  pur¬ 
pose  of  developing  and  marketing  the  StarDrop 
software  [209] . 

StarDrop  can  predict  reaction  sites  on  small- 
molecule  substrates  for  CYPs  3A4,  2D6  and 
2C9.  For  all  these  enzymes,  a  semi-empirical 
calculation  of  hydrogen  removal  energies  at  the 
AMI  level  is  used  [37,38],  This  is  supplemented 
by  additional  empirical  steric  and  orientation 
parameters,  which  are  different  for  each  CYP. 
StarDrop  no  longer  runs  over  the  internet,  but 
as  a  desktop  client,  which  can  be  coupled  to  a 
server  for  running  the  computationally  intensive 
semi-empirical  quantum  mechanical  calcula¬ 
tions.  Because  of  these  calculations,  StarDrop 
predictions  for  the  site  of  metabolism  are  sig¬ 
nificantly  slower  than  methods  involving  only  a 
QSAR  model  or  a  database  lookup,  on  the  order 
of  a  few  minutes  per  molecule.  The  output  from 
StarDrop  is  a  listing  of  predicted  sites  of  metabo¬ 
lism  for  the  query  molecule,  ranked  in  order  of 
the  predicted  relative  proportion  of  metabolites 
formed  at  each  site.  Additionally,  for  CYP  3A4, 
a  ‘composite  site  lability’  is  calculated  for  the 
molecule  as  a  whole,  as  an  estimate  of  the  effi¬ 
ciency  of  metabolism  for  the  entire  molecule. 
This  number  can  be  compared  across  different 
molecules  in  a  data  set  [210]. 

ADMET  Predictor 

ADMET  Predictor,  formerly  known  as 
QMPRPlus,  was  first  released  in  1999  by 
Simulations  Plus  [211].  The  Metabolism  Module 
can  predict  sites  of  metabolism  for  nine  CYPs 
(1A2, 2A6, 2B6, 2C8, 2C9, 2C19, 2D6, 2E1  and 
3A4) .  The  models  for  CYP  site  prediction  were 
derived  from  data  extracted  from  the  Accelrys 
Metabolite  database  [201],  supplemented  with 
additional  literature  data.  ADMET  Predictor 
uses  QSAR  models  built  with  molecular  struc¬ 
ture  descriptors  (e.g.,  molecular  weight,  numbers 
of  various  functional  groups,  geometric  and  elec¬ 
trostatic  properties)  and  trained  using  artificial 
neural  network  classification  ensembles  [212].  In 
contrast  to  other  metabolic  site  prediction  pro¬ 
grams,  ADMET  Predictor  first  predicts  whether 
or  not  the  query  molecule  is  in  fact  a  substrate  for 
the  CYP,  and  then  calculates  a  propensity  score 
for  each  atomic  site  [213] . 

ADMET  Predictor  can  be  run  in  interactive 
mode  (with  a  spreadsheet-like  user  interface)  or 
batch  mode.  In  either  case  the  predictions  are 


very  fast  and  use  limited  computing  resources. 
The  interactive  mode  can  easily  handle  datasets 
of  tens  of  thousands  compounds  on  a  Windows- 
based  PC  with  a  few  GB  of  memory.  For  larger 
datasets,  one  may  run  into  memory  issues  with 
the  spreadsheet  interface  in  the  interactive  mode. 

MetaSite 

MetaSite  from  Molecular  Discovery,  which  was 
first  released  in  2003,  is  perhaps  the  most  well- 
known  program  for  regioselectivity  predictions 
[214].  It  is  based  on  the  alignment  of  3D  inter¬ 
action  points  calculated  for  a  substrate  binding 
site  with  similar  interaction  points  calculated  for 
input  query  compounds.  An  alignment  of  the 
two  sets  of  interaction  points  is  optimized  to 
predict  the  orientation  of  the  query  compound 
in  the  CYP  active  site.  This  allows  the  calcula¬ 
tion  of  an  accessibility  score  for  each  atom  in 
the  query,  based  on  its  distance  to  the  heme  in 
the  CYP  active  site.  A  reactivity  score  is  also  cal¬ 
culated  for  each  atom,  based  on  semi-empirical 
calculations  of  the  energy  for  hydrogen  removal. 
The  probability  of  metabolism  at  each  site  in 
the  input  compound  is  then  given  as  the  prod¬ 
uct  of  its  accessibility  and  reactivity  scores  [39] . 
MetaSite  can  also  calculate  the  contributions  to 
reactivity  by  atoms  adjacent  to  the  site,  which  is 
useful  for  suggesting  synthetic  modifications  to 
a  compound’s  structure  in  cases  where  the  site  of 
metabolism  itself  cannot  be  modified  for  what¬ 
ever  reason.  MetaSite  contains  built-in  models 
for  the  CYP  1A1,  1A2,  2B6,  2C9,  2C19,  2D6, 
2E1,  3A4  and  3A5  substrate  binding  sites,  and 
additional  models  for  mutants  or  other  isoforms 
can  be  built  and  added  by  the  user.  The  newest 
functionality  in  MetaSite  is  the  ability  to  gener¬ 
ate  structures  for  the  metabolites  produced  by 
reactions  at  each  site  [214] . 

TIMES 

Like  Meteor,  Tissue  Metabolism  Simulator 
(TIMES)  is  focused  on  toxicology  predictions. 
TIMES  was  developed  in  2004  at  the  Laboratory 
of  Mathematical  Chemistry  at  Burgas  ‘Prof.  Assen 
Zlatarov’  University  in  Bulgaria  [215] .  Currently,  it 
contains  a  liver  (S9)  metabolism  simulator  [40]  and 
a  skin  metabolism  simulator  [41] ,  both  of  which 
are  based  on  biotransformations  collected  from 
literature  data.  Generation  of  the  metabolic  maps 
produced  from  applying  the  biotransformation 
reactions  to  the  input  query  molecule  is  stopped 
when  a  threshold  for  either  metabolite  occur¬ 
rence  probability  or  logP  is  reached  [40] .  The  set 
of  metabolites  can  then  be  filtered  through  one 
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of  several  QSAR  models  to  rank  them  according 
to  their  predicted  toxicity  [215]. 

Metabolizer 

Metabolizer  is  a  tool  for  the  ChemAxon  platform. 
It  was  introduced  in  2007,  and  it  is  currently  only 
available  as  a  ‘preview’,  pending  the  development 
of  a  full  set  of  biotransformation  libraries.  To  run 
Metabolizer,  a  library  of  reactions  is  loaded  along 
with  the  set  of  substrates,  and  a  set  of  metabolites 
is  enumerated  by  applying  the  reactions  to  the 
substrates  [216]. 

Merck  dataset 

In  2007  researchers  at  Merck  and  Co.  published 
a  set  of  QSAR  models  for  predicting  regio- 
selectivity  for  CYPs  3A4,  2D6  and  2C9  [42], 
This,  in  itself,  is  nothing  extraordinary,  and  their 
methods  are  not  available  as  public  software; 
however,  rather  than  simply  providing  a  list  of 
references  or  referring  vaguely  to  ‘literature  data’ 
they  published  the  structures  of  the  compounds 
used  in  their  training  sets,  which  were  extracted 
from  the  Accelrys  Metabolite  [1]  and  Fujitsu 
ADME  DB  [217,218]  databases,  in  computer-read¬ 
able  mol2  format,  as  supplementary  information 
that  could  be  downloaded.  This  dataset  of  521 
compounds  was  used,  as  either  a  training  or  test 
set,  to  develop  no  fewer  than  three  new  freely 
available  site  prediction  methods:  CypScore, 
SMARTCyp  and  RegioSelectivity  Predictor 
(RS -Predictor). 

CypScore 

CypScore  is  an  implementation  by  CAChe 
Research  [219]  of  a  method  described  by 
researchers  at  Bayer  and  the  University  of 
Erlangen-Nuremberg  (Germany)  in  2009  [43]. 
It  is  a  hybrid  of  a  metabolic  site  prediction 
method  and  a  method  for  estimating  clearance 
or  microsomal  stability  in  that  it  predicts  the 
regiolability  of  query  molecules  toward  a  gener¬ 
alized  ‘super  CYP’;  or,  in  other  words,  identifies 
sites  on  the  input  query  molecules  that  are  the 
most  generally  susceptible  to  oxidation.  Unlike 
some  other  site  prediction  methods,  it  can  be 
used  to  rank  molecules  relative  to  one  another. 
CypScore  provides  a  set  of  six  multiple  linear 
regression  models  for  various  oxidation  reac¬ 
tion  types.  Models  were  trained  using  a  data¬ 
set  curated  from  the  literature  and  from  the 
Biotransformations  [24]  publication  series  that 
formed  the  basis  for  the  Accelrys  Metabolism 
database  [202] .  The  descriptors  in  the  models  are 
features  of  the  molecular  electrostatic  surface 


and  of  the  semi-empirically  calculated  (AMI) 
wave  function  [43] .  The  compounds  published 
by  Merck  [42]  were  used  as  a  test  set.  CypScore 
was  validated  using  a  proprietary  dataset  at 
Bayer  (though  trained  on  public  data),  but  the 
results  were  reported  in  such  a  way  that  the 
models  could  easily  be  reproduced  by  other 
researchers. 

CypScore  is  freely  available  as  a  customized 
plug-in  for  the  modeling  program  MOE  from 
Chemical  Computing  Group  [220]  and  as  a  com¬ 
ponent  for  Pipeline  Pilot  from  Accelrys  [221]. 
However,  to  perform  the  calculations  it  requires 
other  CAChe  Research  software  that  is  available 
under  academic  licensing  terms  [219]. 

SMARTCyp 

SMARTCyp,  released  in  2010,  is  a  simple  and 
elegant  method  for  predicting  sites  that  are 
metabolized  by  CYP  3A4  and  2D6,  relying  on 
only  two  or  three  molecular  descriptors.  The 
first  of  these  is  a  reactivity  descriptor,  which  is 
a  measure  of  the  activation  energy  of  the  oxida¬ 
tion  reaction  at  each  site  in  the  query  molecule, 
looked  up  in  a  table  of  fragment  energies  pre¬ 
calculated  using  density  functional  theory.  The 
second  descriptor  is  a  measure  of  how  far  each 
site  is  from  the  center  of  the  molecule  [44] .  The 
CYP  2D6  model  adds  a  third  descriptor  as  a 
measure  of  the  distance  to  the  nearest  protonated 
nitrogen  atom  [45].  The  aforementioned  com¬ 
pounds  from  Merck  [42]  were  used  as  a  test  set 
to  validate  the  models  [44] . 

SMARTCyp  is  freely  and  openly  available  as 
a  web  service  or  a  downloadable  Java  program 
from  the  University  of  Copenhagen  (Denmark) 
[222].  Since  SMARTCyp  uses  2D  ligand  struc¬ 
ture  information  only,  it  is  very  fast,  and  both 
the  web  service  and  the  Java  executable  of  the 
program  are  user  friendly.  The  open  source  code 
for  SMARTCyp  has  allowed  its  methodology 
to  be  used  and  expanded  by  other  researchers: 
SMARTCyp  predicted  reactivities  are  incor¬ 
porated  as  descriptors  into  the  newest  version 
of  RS-Predictor,  described  below  [46],  and  an 
openly  available  extension  of  the  SMARTCyp 
program  to  cover  three  more  CYPs,  namely  1 A2, 
2C9  and  2C19,  was  recently  implemented  [47]. 

RS-Predictor 

RS-Predictor  was  first  published  in  2011  [48].  It 
uses  a  large  set  of  substructure-based,  physico¬ 
chemical  and  quantum  chemical  (semi-empiri- 
cal  AMI)  descriptors,  and  a  machine  learning 
method  called  multiple  instance  ranking,  which 
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is  a  variation  of  support  vector  machines.  The 
models  were  calibrated  on  a  dataset  consist¬ 
ing  of  the  Merck  compounds  [42],  curated  and 
augmented  by  compounds  from  a  recent  review 
paper  [49] .  RS-Predictor  was  originally  developed 
to  predict  metabolism  sites  for  CYP  3A4,  but  has 
now  been  extended  to  1A2,  2A6,  2B6,  2C19, 
2C8,  2C9,  2D6,  2E1  and  a  theoretical  ‘merged’ 
CYP  enzyme  [46].  Their  training  sets,  anno¬ 
tated  with  site  predictions  from  RS-Predictor, 
SMARTCyp,  StarDrop  and  Schrodinger  P450 
Site  of  Metabolism,  are  available  for  download 
in  SD  format  as  supplementary  information  to 
the  journal  articles  [46,48], 

RS-Predictor  is  freely  available  as  a  web  ser¬ 
vice  from  the  Rensselaer  Exploratory  Center  for 
Cheminformatics  Research  [223].  It  is  simple  to 
use  and  requires  only  the  upload  of  an  SD  file 
containing  the  query  structures.  The  output 
consists  of  an  SD  file  in  return,  annotated  with 
the  predicted  primary,  secondary  and  tertiary 
sites  of  metabolism  for  each  compound  in  each 
CYP  model. 

ACD/Percepta 

ACD/Percepta  [224],  formerly  known  as  ADME 
Suite,  was  created  in  2009  with  the  merger 
of  Pharma  Algorithms  with  ACD/Labs. 
Initially,  the  only  available  metabolism-related 
model,  which  had  been  developed  by  Pharma 
Algorithms,  was  for  CYP  3A4  binding  and  sites 
of  metabolism  [225] .  The  regiolability  models  have 
since  been  expanded  to  include  CYP  1A2,  2C19, 
2C9  and  2D6,  as  well  as  a  general  model  for 
human  liver  microsomes  [226] .  They  are  trained 
on  literature  data  using  fragment-based  descrip¬ 
tors  and  a  partial  least  squares-based  modeling 
method  that  accounts  for  similarity  between  the 
query  compound  and  compounds  in  the  training 
set  as  well  as  the  consistency  of  the  experimental 
data  for  the  training  set  compounds.  This  allows 
a  reliability  measure  to  be  reported  along  with 
each  site  prediction  [50,51].  For  each  prediction 
ACD/Percepta  also  displays  the  five  most  similar 
compounds  from  the  training  set  [226,227] . 

MetaPrint2D 

MetaPrint2D  is  a  new  open-source  tool,  devel¬ 
oped  in  2010  at  the  University  of  Cambridge 
(UK),  that  is  freely  available  as  a  web  service 
[228]  and  as  a  standalone  Java  program  [229],  and 
is  also  distributed  as  part  of  the  Bioclipse  chemo- 
and  bio-informatics  workbench  [52,230].  It  uses 
circular  fingerprints  to  describe  the  environment 
around  each  atom  according  to  the  atom  types 


found  bonded  to  it  [53].  The  Accelrys  Metabolite 
database  [201]  was  mined  to  extract  counts  of  how 
often  biotransformations  occur  (or  do  not  occur) 
at  specific  atom  environments.  The  ratio  of  how 
often  an  atom  environment  is  versus  is  not  a  reac¬ 
tion  center  gives  an  estimate  of  the  likelihood 
of  a  biotransformation  reaction  occurring  at  an 
atom  environment  of  that  type  in  a  new  query 
compound  [54],  The  basic  method  for  counting 
metabolite  occurrence  frequencies  is  the  same  as 
that  used  by  MetaDrug  to  prioritize  predicted 
metabolites  [55] .  The  output  from  MetaPrint2D  is 
a  2D  structural  drawing  of  the  query  compound 
with  its  atoms  colored  to  indicate  the  likelihood 
of  a  biotransformation  occurring  at  that  site.  In 
a  recent  extension  called  MetaPrint2D-React, 
the  reactions  that  may  occur  at  each  atomic  site 
and  the  structures  of  predicted  metabolites  can 
also  be  generated  [228] . 

P450  Site  of  Metabolism 

A  new  addition  to  the  Schrodinger  Suite  of  mod¬ 
eling  programs  as  of  2011  is  a  docking-based 
P450  Site  of  Metabolism  prediction  method  (an 
initial  version  was  called  IDSite  [56]).  This  is  a 
welcome  development,  inasmuch  as,  aside  from 
MetaSite,  most  if  not  all  other  publicly  avail¬ 
able  regioselectivity  methods  do  not  directly 
incorporate  structural  information  about  the 
interaction  of  a  query  compound  with  the  CYP 
binding  site.  The  Schrodinger  method  relies  on 
an  induced-fit  docking  protocol  [57],  which  uses 
Glide  ligand  docking  into  a  flexible  receptor  site, 
where  side  chain  conformations  are  adjusted 
using  the  homology  modeling  module  Prime. 
Available  CYP  structures  include  2C9  and  2D6 
(regioselectivity  predictions  can  also  be  made  for 
3A4,  but  this  model  does  not  use  docking).  Each 
potential  site  of  metabolism  is  given  an  overall 
score  based  on  its  accessibility  to  the  heme  in 
the  ensemble  of  docked  poses  and  a  rule-based 
calculation  of  intrinsic  reactivity  [231] . 

Compound  interactions  with 
metabolizing  enzymes 

This  general  metabolism  prediction  end  point 
includes  the  classification  of  compounds  as  CYP 
substrates,  inhibitors,  inducers  and  activators, 
binding  predictions  for  phase  II  and  non-CYP 
enzymes,  and  estimations  of  reaction  rate  con¬ 
stants  and  binding  affinities.  Publicly  available 
software  for  predicting  CYP  interactions  tends 
to  be  piecemeal,  with  individual  QSAR  mod¬ 
els  available  as  part  of  larger  sets  of  ADMET 
property  predicting  packages.  The  prediction 


1914 


Future  Med.  Chem.  (2012)  4(15) 


future  science  sroup 


Computational  tools  &  resources  for  metabolism-related  property  predictions  Technology  Review 


of  whether  or  not  a  compound  is  a  substrate 
of  a  given  CYP  enzyme  is  useful  for  the  CYPs 
whose  expression  levels  vary  widely  among 
different  sub-populations  (e.g.,  2D6).  Drugs 
that  are  substrates  of  CYP  3A4  can  have  their 
metabolism  rates  affected  by  dietary  compounds 
such  as  those  found  in  grapefruit  juice.  Knowing 
whether  a  compound  is  a  CYP  inhibitor  is  impor¬ 
tant  for  predicting  drug-drug  interactions.  The 
difference  between  a  substrate  and  an  inhibitor 
is  not  always  clear  -  generally,  an  inhibitor  is  a 
compound  that  binds  tightly  enough  that  the 
rate  of  metabolism  of  other  substrates  is  affected. 
Commonly,  inhibitors  are  defined  as  compounds 
with  an  IC50  cutoff  of  10  pM  or  less  [2], 

CYP  induction  is  an  important  process 
whereby  the  presence  of  xenobiotic  compounds 
can  increase  the  expression  levels  of  certain 
CYPs,  which  has  the  effect  of  increasing  the 
clearance  of  substrates  metabolized  by  those 
CYPs,  potentially  causing  drug-drug  interac¬ 
tions  [58,59].  The  induction  mechanism  is  medi¬ 
ated  by  nuclear  receptor  transcription  factors, 
mainly  the  pregnane  X  receptor  and  the  constitu¬ 
tive  androstane  receptor,  which  together  induce 
CYPs  from  the  2B  and  3A  subfamilies,  as  well  as 
the  phase  II  UDP-glucuronosyl,  sulfo-  and  glu¬ 
tathione  transferases  [60,61] .  In  addition,  the  aryl 
hydrocarbon  receptor  induces  CYPs  from  the  1 A 
subfamily  along  with  several  phase  II  enzymes, 
the  peroxisome  proliferator  activated  receptor 
induces  CYPs  from  the  4A  subfamily,  and  the 
liver  X  receptor  induces  CYPs  from  the  7A  sub¬ 
family  [6o],  In  addition  to  the  expression  level- 
based  induction  of  CYP  activity,  CYPs  can  also 
be  activated  by  certain  compounds  that,  upon 
binding  to  the  CYP  enzyme  itself,  appear  to 
increase  the  V  of  substrate  metabolism.  This 

max 

process  is  not  well  understood,  and  may  either 
involve  two  molecules  occupying  the  active  site 
at  the  same  time,  or  the  presence  of  an  allosteric 
site  in  certain  isoforms  [62], 

■  Databases 
ADME  DB 

ADME  DB  was  developed  by  Slobodan  Rendic 
of  Zagreb  University  in  Croatia,  in  collaboration 
with  Fujitsu  beginning  in  2004  [63],  and  based  on 
data  collected  from  the  literature  on  xenobiotic 
interactions  with  CYPs  and  other  metabolizing 
enzymes  [64,65].  Previously,  these  data  had  also 
been  used  in  the  software  tool  BioFrontier/P450, 
which  was  used  to  predict  metabolites  produced 
by  CYPs  [232]  but  which  is  no  longer  commer¬ 
cially  available.  As  of  April  2012,  ADME  DB 


contains  information  on  27,980  CYP  substrates 
as  well  as  6125  substrates  of  other  phase  I  and 
phase  II  metabolizing  enzymes:  esterases,  UDP- 
glucuronosyltransferases,  sulfotransferases,  glu¬ 
tathione  5-transferases  and  flavin-containing 
monooxygenases  [217,218] . 

ADME  DB  can  be  searched  by  structure  or 
substructure,  but  the  search  results  cannot  be 
exported  in  a  structural  format.  The  data  for 
each  compound  include  whether  it  is  also  an 
inhibitor,  inducer  or  activator,  the  metabolic 
reaction  it  undergoes,  and  (where  available) 
kinetic  and  other  experimental  information 
about  the  reaction  such  as  the  in  vitro  assay  model 
used,  K  ,  V  ,  K.,  IC,.,  EC,  and/or  half-life. 
Unfortunately,  the  license  terms  for  this  database 
preclude  any  large-scale  structure  and  property 
extraction,  as  well  as  the  public  dissemination  of 
any  models  derived  from  the  data. 

PubChem 

The  well  known  PubChem  database  [66,233] 
contains  a  series  of  freely  available  assays  depos¬ 
ited  by  the  NIH  Chemical  Genomics  Center 
(NCGC),  for  CYP  inhibitors  and  substrates: 
1A2  [234],  3A4  [235],  2C9  [236],  2C19  [237]  and 
2D6  [238],  along  with  a  panel  assay  with  all 
five  cytochromes  [239].  The  Sanford-Burnham 
Medical  Research  Institute  (USA)  has  also 
screened  for  CYP  2C9  [240]  and  2C19  [241]  inhibi¬ 
tors.  The  Scripps  Research  Institute  Molecular 
Screening  Center  (USA)  has  completed  a  set  of 
primary  and  confirmatory  assays  for  aryl  hydro¬ 
carbon  receptor  activators,  along  with  a  counter 
screen  for  activators  of  the  pregnane  X  recep¬ 
tor  [242].  Additionally,  NCGC  has  deposited  a 
screen  for  activators  of  CYP  3A4  [243],  and  the 
Sanford-Burnham  Medical  Research  Institute 
has  screened  for  activators  of  2C9  [244]  and 
2C19  [245].  These  newer  datasets  have  already 
been  used  in  models  included  in  MetaDrug  and 
ACD/Percepta,  discussed  below. 

The  Comparative  Toxicogenomics  Database 

The  Comparative  Toxicogenomics  Database 
(CTD)  is  a  meticulously  curated  collection  of 
chemicals,  genes  and  diseases,  and  the  relation¬ 
ships  and  connections  between  them  [246] .  The 
focus  of  this  resource  is  on  links  between  envi¬ 
ronmental  toxins  and  diseases  in  humans  [67,68], 
but  relevant  to  metabolism,  it  includes  all  of 
the  CYP  genes  from  hundreds  of  species,  along 
with  UDP-glucuronosyltransferases,  sulfotrans¬ 
ferases,  Af-acetyltransferases  and  so  forth.  The 
relationships  between  genes  and  compounds  are 
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described  with  a  controlled  vocabulary  and  so 
can  be  filtered  according  to  the  type  of  interac¬ 
tion.  Along  with  binding,  activity  and  expres¬ 
sion,  interactions  include  metabolic  processing 
and  a  list  of  35  hierarchical  reaction  types,  such 
as  hydroxylation,  glucuronidation  and  acety¬ 
lation.  The  entire  set  of  chemical— gene  inter¬ 
actions  can  be  downloaded,  or  more  specific 
datasets  can  easily  be  generated  using  batch 
queries  [246].  However,  chemical  structures  are 
not  included  and  compounds  are  identified  by 
MeSH  and  CAS  numbers,  so  any  extracted 
data  would  need  to  be  collated  with  structural 
information  from  another  resource  for  model 
building. 

■  Software 
PASS 

Prediction  of  Activity  Spectra  for  Substances 
(PASS)  is  a  QSAR  modeling  program  that 
can  predict  thousands  of  different  bioactivi¬ 
ties.  The  first  version  of  PASS  was  developed  in 
1995,  at  the  Institute  of  Biomedical  Chemistry, 
in  the  Russian  Academy  of  Medical  Sciences. 
PASS  is  now  commercially  available  through 
GeneXplain  [247].  Predictions  from  PASS  are 
qualitative  and  reported  as  the  probability  that 
each  input  compound  is  active  and  that  it  is 
inactive.  The  descriptors  used  in  PASS  are  sub- 
structure-based,  the  so-called  multilevel  neigh¬ 
borhoods  of  atoms  descriptors  [69].  Its  default 
models  were  trained  on  a  dataset  (called  the  SAR 
Base)  that  has  been  collected  from  the  literature 
and  various  databases  since  1972,  and  as  of  the 
2011  version  numbers  250,407  compounds. 

PASS  can  qualitatively  predict  substrates  for  97 
CYP  types  and  subtypes,  inhibitors  for  27  CYPs, 
and  inducers  for  12  CYPs  (without  regard  to 
mechanism).  There  is  also  a  model  for  predicting 
aryl  hydrocarbon  receptor  agonists,  which  could 
also  potentially  be  CYP  inducers.  Additionally, 
PASS  has  models  for  70  non-CYP  metabolism 
activities,  including  flavin-containing  monooxy¬ 
genase  substrates,  glutathione  5-transferase  sub¬ 
strates,  monoamine  oxidase  substrates,  peroxi¬ 
dase  substrates,  sulfotransferase  substrates,  and 
UDP-glucuronosyltransferase  substrates,  among 
others.  The  training  sets  for  individual  models 
in  PASS  (as  part  of  the  large  default  SAR  Base) 
may  be  quite  small,  and  this  of  course  affects  the 
ability  of  PASS  to  predict  those  activities  with 
high  probabilities  for  arbitrary  compounds  [70], 
For  each  compound  in  the  query  set,  PASS  notes 
whether  it  contains  substructure  descriptors  that 
were  not  seen  in  the  training  set,  giving  a  further 


estimate  of  whether  or  not  the  model  is  likely  to 
be  predictive. 

StarDrop 

In  a  separate  functionality  from  its  regiolability 
predictions,  StarDrop  from  Optibrium  [209]  has 
two  QSAR  models  for  CYP-binding  affinity  — 
a  continuous  model  for  2C9  and  a  four-class 
classification  model  for  2D6.  These  models 
were  trained  on  small  sets  of  in  vitro  data  col¬ 
lected  in-house,  and  built  using  physicochemical 
descriptors  along  with  atom  type  and  function¬ 
ality  counts.  The  2D6  classification  model 
was  built  using  decision  trees  and  the  2C9  pK. 
model  uses  rule-based  partial  least  squares  equa¬ 
tions.  These  models  do  not  attempt  to  predict 
whether  a  compound  will  be  a  CYP  substrate, 
only  what  its  affinity  will  be  if  it  is  a  substrate; 
their  purpose  is  to  predict  potential  drug-drug 
interactions.  StarDrop  determines  whether  an 
input  compound  fits  into  the  chemical  descrip¬ 
tor  space  of  the  models,  and  if  not,  an  error  of 
prediction  with  a  value  of  infinity  is  returned. 
Otherwise,  an  estimate  of  the  root-mean  square 
error  of  prediction  is  given  [248].  However,  the 
error  of  prediction  is  not  shown  by  default  in  the 
results  window  so  users  must  exercise  caution. 

Accelrys 

An  ADME  toolkit  called  C2.ADME  was 
released  for  version  4.6  of  the  Cerius2  modeling 
program  from  Accelrys  in  2001  [249].  This  tool¬ 
kit  included  a  QSAR  model  for  CYP  2D6  inhi¬ 
bition,  based  on  a  training  set  of  100  compounds 
from  the  literature.  These  compounds  were  col¬ 
lected  using  what  the  authors  referred  to  as  a 
‘fairly  relaxed  set  of  criteria’  so  that  the  training 
set  was  diverse  in  terms  of  both  chemistry  and 
bioactivity.  The  models  were  trained  using  an 
ensemble  recursive  partitioning  technique  and 
Cerius2  topological  and  atom  type  descriptors 
[71] .  Currently  the  ADME  toolkit  is  available  in 
Discovery  Studio  [250,251]  and  as  a  component  in 
the  Pipeline  Pilot  ADME-Tox  collection  [221,252]. 

ADMEWORKS  Predictor 

ADMEWORKS  Predictor,  from  Fujitsu,  con¬ 
tains  a  set  of  QSAR  models  derived  from  the 
data  in  the  ADME  DB.  It  can  predict  Km  for 

CYP  2D6  and  3A4,  as  well  as  K  for  CYp'3A4 

1 

inhibition  [253].  Models  for  predicting  kinetics 
and  inhibition  for  other  CYP  types  are  under 
development.  It  is  run  over  a  web  server  located 
in  Poland  and,  compared  with  a  locally  run 
program,  the  performance  is  quite  slow. 
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Meta  Drug 

MetaDrug,  from  GeneGo  [204],  includes  a  set  of 
CYP  substrate  and  inhibition  QSAR  models  that 
can  be  applied  to  both  input  compounds  and  to 
the  full  set  of  their  predicted  metabolites.  These 
are  binary  classification  models  for  substrates  of 
1A2,  2B6,  2D6  and  3A4,  and  for  inhibitors  (at 
IC50  <10  pM)  of  1A2, 2C19, 2C9, 2D6  and  3A4. 
The  substrate  models  were  trained  on  literature 
data  from  MetaBase,  and  the  inhibitor  models 
were  trained  on  data  from  the  cytochrome  panel 
assay  in  PubChem  submitted  by  NCGC  [239], 
using  recursive  partitioning  methods  and  physi¬ 
cochemical  property  descriptors  [72].  MetaDrug 
also  has  a  classification  model  for  pregnane  X 
receptor  activation,  based  on  literature  data  from 
MetaBase,  which  could  identify  some  CYP- 
inducing  compounds,  and  a  quantitative  predic¬ 
tion  of  the  pIC50  of  human  soluble  epoxide  hydro¬ 
lase  inhibition,  an  important  phase  II  enzyme. 
For  the  models  trained  on  MetaBase  data,  the 
training  set  can  be  viewed  with  a  single  click  from 
each  model  description,  which  also  includes  the 
number  of  molecules  in  the  training  and  test 
sets,  and  the  statistical  parameters  for  sensitiv¬ 
ity,  specificity,  and  accuracy.  Model  applicability 
domains  are  calculated  using  Tanimoto  prioritiza¬ 
tion,  which  provides  a  measure  of  the  similarity 
between  the  query  compound  and  its  most  similar 
training  set  compound. 

MetaDrug  is  run  through  a  web  interface. 
Input  compounds  can  be  uploaded  in  SD  format, 
and  the  results  can  be  viewed  online  or  exported 
as  an  Excel  file.  MetaDrug  is  not  intended  for 
analysis  of  large  datasets  -  the  QSAR  model  pre¬ 
dictions  can  only  be  run  on  files  with  less  than 
500  compounds,  and  the  calculations  performed 
on  enumerated  metabolites  can  only  be  run  if  the 
input  file  has  fewer  than  12  compounds. 

isoCYP 

isoCYP  was  released  in  2007,  and  uses  a  simple 
model  to  classify  input  compounds  as  metab¬ 
olized  primarily  by  either  CYP  3A4,  2D6,  or 
2C9  [254].  The  training  set  for  the  model  was 
taken  from  an  earlier  study  from  researchers  at 
Liverpool  John  Moores  University  (UK),  who 
had  compiled  a  test  set  of  known  drugs  [73]  and 
a  training  set  from  the  literature,  and  used  these 
to  construct  recursive  partitioning  models  for 
CYP  substrate  classification  [74] .  The  structures 
of  these  compound  sets  were  listed  in  SMILES 
format  in  a  table  in  the  manuscript.  The  pro¬ 
gram  isoCYP  is  an  attempt  by  researchers  at 
Molecular  Networks  to  improve  upon,  and 


market,  the  predictivity  of  the  CYP  substrate 
classification  models,  using  molecular  proper¬ 
ties  as  descriptors  and  a  support  vector  machine 
for  classification.  A  set  of  compounds  from  the 
Accelrys  Metabolite  database  [201]  was  also  used 
as  an  external  test  set  [75] . 

isoCYP  is  intended  to  be  used  for  rapid  filter¬ 
ing  of  large  datasets,  and  can  be  run  from  the 
command  line,  from  a  graphical  user  interface, 
or  through  a  component  for  Pipeline  Pilot.  It 
can  also  be  tested  free  of  charge  for  small  sets  of 
compounds  on  the  Molecular  Networks  website 
[254],  The  output  for  each  compound  is  simply 
the  name  of  the  CYP  for  which  it  is  predicted  to 
be  the  most  likely  substrate.  It  does  not  take  into 
account  the  possibility  of  overlapping  substrates, 
or  the  possibility  of  non-CYP  metabolism  [75] . 

ADMET  Predictor 

In  addition  to  its  metabolic  site  predictions, 
the  Metabolism  Module  in  ADMET  Predictor 
from  Simulations  Plus  has  classification  models 
for  substrate  binding  to  CYPs  1A2,  2A6,  2B6, 
2C8,  2C9,  2C19,  2D6,  2E1  and  3A4.  There 
are  also  classification  models  for  inhibition  of 
five  CYPs  (1A2,  2C9,  2C19,  2D6  and  3A4), 
as  well  as  specific  K.  values  for  the  inhibition 
of  3A4-mediated  metabolism  of  midazolam 
and  testosterone.  These  models  were  released 
in  2008  and  developed  in  collaboration  with 
Enslein  Research,  who  collected  and  curated  the 
literature  data  upon  which  the  models  are  built 
[76].  Additionally,  ADMET  Predictor  includes 
QSAR  classification  models  for  nine  human 
UDP-glucuronosyltransferase  isozymes.  These 
can  predict  whether  or  not  a  compound  will 
undergo  a  glucuronidation  reaction  before  being 
oxidized  in  a  phase  I  reaction  [213] .  All  input  com¬ 
pounds  are  checked  to  determine  whether  their 
molecular  properties  fall  within  the  descriptor 
space  of  each  model,  and  if  not,  the  prediction  is 
considered  to  be  outside  the  scope  of  the  model 
and  flagged  in  the  output  [212]. 

ACD/Percepta 

ACD/Percepta  has  QSAR  models  for  the  pre¬ 
diction  of  substrates  and  inhibitors  for  CYPs 
1A2,  2C19,  2C9,  2D6  and  3A4.  The  models 
were  trained  on  the  same  data  used  for  the  ACD/ 
Percepta  regiolability  models,  along  with  some 
additional  data  from  known  drugs.  The  inhi¬ 
bition  models  also  used  data  from  PubChem 
bioassays  by  NCGC  [36].  The  output  for  inhib¬ 
itors  is  presented  as  the  probability  that  a  com¬ 
pound  will  bind  with  IC50  <50  pM  and  with 
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IC50  <10  iiM  [227].  As  with  the  regiolability 
models,  a  reliability  measure  is  reported  for 
the  prediction.  Additionally,  the  models  can  be 
revised  and  expanded  to  include  any  new  experi¬ 
mental  data  that  might  be  generated  by  the  user 
to  increase  their  coverage  of  chemical  space  and 
therefore  the  applicability  domain  [77]. 

VirtualToxLab 

VirtualToxLab  is  unique  among  CYP  binding 
prediction  software  programs  in  that  it  incor¬ 
porates  either  an  automated  docking  run  or  a 
ligand-based  pharmacophore  alignment  to  pro¬ 
duce  a  set  of  predicted  binding  modes  against 
the  target  of  interest.  For  each  target,  a  multi¬ 
dimensional  QSAR  model  has  been  developed, 
consisting  of  a  linear  regression  equation  with 
atomistic  descriptors,  to  convert  scores  calcu¬ 
lated  from  the  docked  or  aligned  poses  into  pre¬ 
dicted  binding  affinities  [78,255] .  VirtualToxLab 
has  models  for  CYPs  1A2,  2A13,  2D6,  2C9  [79] 
and  3A4  [so],  as  well  as  other  anti-targets  that 
are  involved  in  receptor-mediated  toxicity  reac¬ 
tions.  A  ‘toxic  potential’  can  then  be  calculated 
for  each  input  compound  based  on  its  predicted 
binding  affinities  to  all  the  models  [78].  Three 
of  these  anti-targets  are  potentially  involved  in 
CYP  induction:  the  aryl  hydrocarbon  [81],  PPAR 
gamma  [82]  and  liver  X  receptors  [83] . 

VirtualToxLab  was  developed  by  the 
Biographies  Laboratory  of  the  3R  Research 


Foundation  in  Switzerland,  whose  goal  is  to 
find  alternatives  to  animal  experimentation. 
A  version  of  VirtualToxLab  is  available  (with 
some  reduced  functionality)  for  free  to  aca¬ 
demic  and  nonprofit  researchers.  It  is  run  with 
a  Java  interface,  using  the  Java  OpenGL  library, 
installed  on  the  client  computer  and  communi¬ 
cating  over  an  SSH  connection  with  the  remote 
server  in  Switzerland.  This  graphical  interface 
allows  the  preparation  of  3D  structures  for  the 
input  compounds,  and  the  visualization  of  the 
docked  poses  for  each  compound.  Due  to  the 
computationally  intensive  nature  of  the  calcula¬ 
tions,  they  take  several  hours  per  molecule  per 
target,  and  only  one  molecule  can  be  run  at  a 
time;  however,  a  queuing  system  allows  multiple 
molecules  to  be  submitted  in  the  same  session. 

■  Predictions 

We  collected  from  the  Drugs@FDA  website 
[256]  a  small  test  set  of  ten  molecules  that  were 
very  recently  (within  the  last  year)  approved  as 
drugs  (new  molecular  entities).  CYP  substrate, 
inhibition,  and  induction  data  were  extracted 
manually  from  the  reviews  included  in  the  drug 
approval  packages.  Basic  information  about  each 
drug  is  listed  in  Table  3,  and  structures  are  given 
in  Table  4.  We  attempted  to  include  a  variety  of 
chemotypes  and  therapeutic  indications. 

Here,  the  goal  was  not  to  provide  a  rigor¬ 
ous  quantitative  benchmark  of  the  accuracy  of 


Table  3.  Recently  approved  drugs  used  as  a  small  test  set. 


Name 

Trade  name 

Company 

New  drug 
application 

Approval 

Mechanism 

Target 

Disease 

Axitinib 

Inlyta® 

Pfizer 

202324 

27  January  2012 

Kinase  inhibitor 

VEGFR 

Renal  cell  carcinoma 

Vismodegib 

Erivedge™ 

Genentech 

203388 

30  January  2012 

Hedgehog 

pathway 

inhibitor 

Smoothened 

GPCR 

Basal  cell  carcinoma 

Ivacaftor 

Kalydeco™ 

Vertex 

203188 

31  January  2012 

CFTR 

potentiator 

CFTR  chloride 
channel 

Cystic  fibrosis 

Ruxolitinib 

Jakafi® 

Incyte/Novartis 

202192 

16  November  201 1 

Kinase  inhibitor 

JAK1,  JAK2 

Myelofibrosis 

Crizotinib 

Xalkori® 

Pfizer 

202570 

26  August  2011 

Kinase  inhibitor 

ALK,  c-Met, 
RON 

Non-small-cell  lung 
cancer 

Indacaterol 

Arcapta™ 

Novartis 

022383 

1  July  2011 

Long-acting  P2 
agonist 

p2  adrenergic 
receptor 

Chronic  obstructive 
pulmonary  disease 

Ticagrelor 

Brilinta® 

AstraZeneca 

022433 

20  July  2011 

Platelet 

aggregation 

inhibitor 

Platelet  P2Y12 
ADP-receptor 

Acute  coronary 
syndrome 

Ezogabine 

Potiga™ 

GlaxoSmithKline/ 

Valeant 

022345 

10  June  2011 

Potassium 
channel  opener 

KCNQ/Kv7 

Epilepsy 

Linagliptin 

Tradjenta® 

Boehringer 

Ingelheim 

201280 

2  May  2011 

Peptidase 

inhibitor 

DPP-4 

Type  2  diabetes 

Abiraterone 

Zytiga® 

Janssen 

202379 

28  April  2011 

Androgen 

biosynthesis 

inhibitor 

CYP17 

Prostate  cancer 
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Table  4.  Test  set  structural  information. 


Name  Structure 


Axitinib 


Vismodegib 


Ivacaftor 


Ruxolitinib 


Crizotinib 


SMILES  InChl  Key 

C2=C(SC1  =CC=CC=C1C(=0)NC)  RITAVMQDGBJQJZ- 

C=CC3  =  C2[NH]  UHFFFAOYSA-N 

N=C3C=CC4=CC=CC=N4 


Cl  =C(CI)C(=CC=C1  [S](=0)(=0)C)  BPQMGSKTAYIVFO- 

C(=0)NC2=CC=C(CI)C(=C2)  UHFFFAOYSA-N 

C3=CC=CC=N3 


CC(C)(C)C1  =  CC(=C(C=C1NC(=0)  PURKAOJ  PTOLRMP- 

C3  =  CNC2=CC=CC=C2C3  =  0)0)C(C)  UHFFFAOYSA-N 
(C)C 


C[NH]C2=NC=NC(=C12)C3  =  C[N]  HFNKQEVNSGCOJV- 

(N  =  C3)[C@H](CC#N)C4CCCC4  OAHLLOKOSA-N 


[C@H](OC1  =  CC(=CN  =  C1  N)C2=C[N]  KTEIFNKAUNYNJU- 

(N  =  C2)C3CCNCC3)(C4=C(CI)  GFCCVEGCSA-N 

C=CC(=C4CI)F)C 


prediction  for  each  software  program  because 
this  is  a  very  small  set  of  compounds  and, 
although  they  are  new  drugs,  all  of  them  have 
been  known  in  the  literature  for  many  years  so 
we  cannot  guarantee  that  none  of  them  is  pres¬ 
ent  in  any  of  the  model  training  sets.  Rather,  we 
intended  to  give  a  flavor  of  the  range  of  enzyme 


interactions  that  could  be  calculated,  the  confi¬ 
dence  given  by  each  program  in  its  predictions, 
and  a  sense  of  in  which  contexts  each  program 
might  be  useful. 

We  then  ran  predictions  of  the  metabolizing 
enzyme  interactions  against  this  test  set  from  all 
of  the  programs  discussed  in  this  section,  with 
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Table  4.  Test  set  structural  information  (cont.). 


Name  Structure 


SMILES 


InChl  Key 


[C@@H](0)(C1  =C2C(=C(0)C=C1)  QZZUEBNBZAPZLX- 

NC(=0)C=C2)CNC3CC4  =  C(C3)  QFIPXVFZSA-N 

C=C(C(=C4)CC)CC 


Ticagrelor 


[C@@H]4([C@H](NC1  =  NC(=NC2=C1  OEKWJQXRCDYSHL- 
N  =  N[N]2[C@H]3[C@H](0)[C@H](0)  FNOIDJSQSA-N 
[C@@H](OCCO)C3)SCCC)C4) 

C5  =  CC(=C(F)C=C5)F 


Ezogabine 


Cl  =CC(=CC(=C1NC(0CC)  =  0)N)  PCOBBVZJEWWZFR- 

NCC2=CC=C(C=C2)F  UHFFFAOYSA-N 


Linagliptin 


Abiraterone 


[C@H]5(N)  LTXREWYXXSTFRX- 

CN(C1  =NC2=C([N]1CC#CC)C(=0)  QGZVFWFLSA-N 

N(C(=0)N2C)CC3  =  NC(=C4C(=N3) 

C=CC=C4)C)CCC5 


[C@@H]4([C@H](NC1=NC(=NC2=C1  UVIQSJCZCSLXRZ- 
N  =  N[N]2[C@H]3[C@H](0)[C@H](0)  UBUQANBQSA-N 
[C@@H](OCCO)C3)SCCC)C4) 

C5  =  CC(=C(F)C=C5)F 


the  exception  of  ADMEWORKS  Predictor 
(because  the  trial  version  calculates  only  a 
small  subset  of  the  available  properties)  and 
the  CYP  2D6  inhibition  model  in  the  Accelrys 
ADME  toolkit.  Versions  used  were:  PASS  2011 
Professional,  StarDrop  5.0,  the  online  demo  ver¬ 
sion  of  isoCYP,  MetaDrug  (versionless,  2012), 


ADMET  Predictor  6.0,  and  VirtualToxLab  4.2. 
The  amount  of  computer  time  required  to  run 
the  predictions  ranged  from  a  few  seconds 
for  the  programs  with  simple  QSAR  models, 
to  1  h  for  MetaDrug,  to  over  a  week  for  the 
computationally  intensive  docking  runs  of 
VirtualToxLab. 
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Figure  1.  Predicted  metabolizing  enzyme  interactions  with  the  test  set  (Table  3).  Checkmarks  indicate  correct  predictions, 
crosses  indicate  incorrect  predictions  and  empty  squares  mean  no  prediction  of  the  interaction  was  made.  Confident  predictions  are 
shaded  dark  gray  and  uncertain  predictions  are  shaded  light  gray. 

AP:  ADMET  Predictor  6.0;  iC:  isoCYP;  MD:  MetaDrug;  PC:  ACD/Percepta;  PS:  PASS  2011;  SD:  StarDrop  5.0;  VT:  VirtualToxLab  4.2. 
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Figure  1.  Predicted  metabolizing  enzyme  interactions  with  the  test  set  (Table  3)  (cont.).  Checkmarks  indicate  correct 
predictions,  crosses  indicate  incorrect  predictions  and  empty  squares  mean  no  prediction  of  the  interaction  was  made.  Confident 
predictions  are  shaded  dark  gray  and  uncertain  predictions  are  shaded  light  gray. 

AP:  ADMET  Predictor  6.0;  iC:  isoCYP;  MD:  MetaDrug;  PC:  ACD/Percepta;  PS:  PASS  2011;  SD:  StarDrop  5.0;  VT:  VirtualToxLab  4.2. 


The  results  of  the  predictions  are  shown  in 
Figure  I .  For  PASS,  all  predictions  where  the 
probability  that  the  compound  is  active  was 
greater  than  the  probability  that  it  is  inactive 
(Pa  >  P;)  are  included.  Uncertain  predictions  are 
defined  as  those  where  Pa  <0.5.  For  StarDrop, 
CYP  2C9  inhibition  was  defined  as  a  predicted 
pK  >6  and  2D6  inhibitors  were  defined  as 
those  with  predicted  affinity  high  or  very  high. 
Uncertain  predictions  are  those  with  an  infinite 


error  of  prediction  for  2C9  and  those  where 
the  probability  for  each  class  of  2D6  inhibition 
is  equal.  In  MetaDrug,  substrates/inhibitors 
with  a  calculated  QSAR  value  >0.5  were  con¬ 
sidered  active,  and  uncertain  predictions  were 
those  with  a  Tanimoto  prioritization  <50%. 
ADMET  Predictor  provides  simple  yes/no  pre¬ 
dictions  for  CYP  substrates  and  inhibitors.  In 
ACD/Percepta,  active  substrates  and  inhibitors 
were  considered  to  be  those  with  a  probability 
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Totals 
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Figure  1.  Predicted  metabolizing  enzyme  interactions  with  the  test  set  (Table  3)  (cont.)-  Checkmarks  indicate  correct 
predictions,  crosses  indicate  incorrect  predictions  and  empty  squares  mean  no  prediction  of  the  interaction  was  made.  Confident 
predictions  are  shaded  dark  gray  and  uncertain  predictions  are  shaded  light  gray. 

AP:  ADMET  Predictor  6.0;  iC:  isoCYP;  MD:  MetaDrug;  PC:  ACD/Percepta;  PS:  PASS  2011;  SD:  StarDrop  5.0;  VT:  VirtualToxLab  4.2. 


>0.5  for  an  IC50  <50  pM,  and  uncertain  predic¬ 
tions  were  those  with  a  reliability  index  <0.3.  In 
VirtualToxLab,  if  binding  was  predicted  to  be 
medium,  high  or  very  high,  the  compound  was 
considered  to  be  an  inhibitor  or  inducer. 

We  did  not  observe  any  overwhelming  dif¬ 
ference  in  accuracy  between  the  different  soft¬ 
ware  programs  for  our  small  test  set,  although 
of  course,  it  bears  re-emphasizing  that  we  are 
presenting  only  a  qualitative  observation  and 
not  a  rigorous  statistical  test.  The  most  accu¬ 
rate  results  (defined  as  the  number  of  correct 
predictions  divided  by  the  total  number  of 
predictions)  were  achieved  by  isoCYP,  which 
predicted  that  all  of  the  drugs  would  be  metab¬ 
olized  by  CYP  3A4,  and  in  fact  all  of  them 
are,  with  the  exception  of  ezogabine,  which 
is  not  a  CYP  substrate  at  all.  VirtualToxLab 
and  StarDrop  both  seem  to  perform  well,  with 
accuracies  of  approximately  75%.  The  remain¬ 
ing  programs  averaged  approximately  60% 
accuracy. 

We  also  calculated  the  sensitivity  and  the 
specificity  for  each  program.  Sensitivity  is  calcu¬ 
lated  as  the  number  of  true  positives  divided  by 
the  sum  of  true  positives  and  false  negatives,  and 
defines  the  program’s  ability  to  correctly  make 
positive  predictions  (e.g.,  ivacaftor  is  a  substrate 
of  CYP  3A4).  Specificity  is  calculated  as  the 
number  of  true  negatives  divided  by  the  sum 
of  true  negatives  and  false  positives,  and  defines 
the  program’s  ability  to  correctly  make  negative 
predictions  (e.g.,  linagliptin  does  not  induce 


CYP  2D6).  Here,  the  differences  between  the 
programs  were  more  pronounced. 

PASS  has  the  most  available  models  for 
metabolizing  enzyme  interactions  out  of  all  the 
programs.  However,  by  default,  PASS  does  not 
show  negative  results  (i.e.,  that  an  input  com¬ 
pound  is  not  an  inhibitor  of  a  given  enzyme).  If 
the  result  of  a  prediction  is  not  on  the  list,  then 
we  know  the  probability  of  that  activity  is  less 
than  the  predicted  probability  of  its  inactivity 
(P  <  P.)  but  we  are  not  sure  how  confident  that 
prediction  is. 

ACD/Percepta  is  very  conservative  in  express¬ 
ing  confidence  in  its  predictions.  The  developers 
regard  a  reliability  index  >0.7  as  a  good  predic¬ 
tion,  and  a  reliability  index  between  0.3  and  0.7 
as  borderline.  However,  with  our  test  set  none  of 
the  predictions  had  a  reliability  index  >0.7,  and 
only  two  predictions  were  >0.5.  We  therefore 
used  the  bottom  of  the  borderline,  0.3,  as  the 
cutoff,  but  even  here  only  a  third  of  the  predic¬ 
tions  were  confident,  the  lowest  ratio  of  all  the 
programs.  Generally  speaking,  the  accuracy  of 
the  confident  predictions  in  the  other  software 
programs  was  slightly  higher.  The  greatest  jump 
in  accuracy  between  all  predictions  and  only 
confident  predictions  was  seen  with  MetaSite  at 
14%  improvement. 

This  simple  set  of  test  results  serves  to  highlight 
some  of  the  difficulties  faced  by  bench  chemists 
or  modelers  without  a  particular  interest  in  the 
gritty  details  of  metabolic  enzyme  interactions 
or  the  pitfalls  of  QSAR  modeling,  who  might 
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only  want  a  reasonably  accurate  prediction  of  the 
potential  for  toxicity  or  unfavorable  drug-drug 
interactions  as  a  filter  in  a  screening  cascade,  or 
as  a  check  on  a  set  of  analogs  proposed  for  syn¬ 
thesis.  Is  a  60%  accuracy  rate  really  any  kind  of 
significant  improvement  over  a  random  selection 
of  compounds?  For  a  single  compound  or  a  small 
set  proposed  for  synthesis,  in  the  absence  of  any 
prior  knowledge  about  the  metabolic  behavior  of 
the  chemotype,  such  a  prediction  is  essentially 
meaningless.  One  might  well  be  better  off  gen¬ 
eralizing  from  a  set  of  simple  rules  of  thumb 
[84],  Conversely,  some  of  the  programs  (PASS, 
StarDrop,  and  VirtualToxLab)  do  show  sensitivi¬ 
ties  in  the  80-90%  range,  suggesting  that  they 
may  be  useful  in  a  situation  where  false  positives 
can  be  tolerated  but  false  negatives  (i.e.,  missing 
a  good  compound)  are  undesirable.  As  a  filter  to 
eliminate  potentially  problematic  compounds, 
isoCYP,  StarDrop,  or  ADMET  Predictor  would 
seem  to  be  good  choices,  for  rapidly  if  not  entirely 
accurately  predicting  whether  a  compound  is 
likely  to  be  a  substrate  or  an  inhibitor  of  CYP 
2C9  or  2D6.  MetaDrug  also  shows  reasonable 
specificity,  but  it  is  too  slow  to  be  used  for  a  large 
database  of  compounds.  Figure  I  also  highlights, 
in  its  expanses  of  empty  uncharted  space,  the  lack 
of  available  computational  predictions  for  most  of 
the  measurements  that  would  be  required  by  the 
FDA  for  any  new  drug  application.  This  is  sum¬ 
marized  in  the  form  of  a  calculated  predictivity 
for  each  program,  meaning  the  number  of  cor¬ 
rect  predictions  divided  by  the  total  number  of 
datapoints  we  had  collected  for  all  compounds. 
With  the  exception  of  PASS,  all  the  programs 
have  a  predictivity  of  less  than  one-third. 

Metabolic  stability 

The  total  body  clearance  and/or  the  half-life 
of  a  xenobiotic  compound  (these  are  inversely 
proportional  to  one  another  by  the  volume  of 
distribution)  is  a  complicated  end  point  to  model 
because  it  involves  multiple  enzymatic  reactions 
and  depends  on  factors  such  as  the  extent  of 
plasma  protein  binding  and  the  involvement  of 
active  transport  across  membranes.  To  simplify 
the  situation,  the  intrinsic  clearance  of  a  com¬ 
pound  can  be  considered.  This  consists  of  the 
portion  of  the  total  clearance  that  can  be  attrib¬ 
uted  to  the  removal  of  drug  from  the  blood  by  the 
liver  [85] .  Intrinsic  clearance  can  be  approximated 
using  one  or  a  series  of  in  vitro  assays,  including 
kinetics  with  recombinant  CYPs  expressed  in 
baculoviruses,  and  half-life  measured  in  human 
liver  microsomes,  liver  S9  fraction,  or  intact 


hepatocytes.  These  in  vitro  assays  are  generally 
undertaken  in  a  relatively  high-throughput  man¬ 
ner  as  an  early  screen  during  drug  discovery,  and 
referred  to  as  metabolic  stability  assays  [86,87] . 

It  should  be  emphasized  that  the  in  vitro  assays 
are  themselves  models,  requiring  various  kinds  of 
mathematical  extrapolations  to  relate  the  in  vitro 
measurements  to  in  vivo  clearance  data.  While 
there  is  definitely  a  strong  correlation  between 
in  vitro  and  in  vivo  clearance,  there  is  some  con¬ 
cern  about  how  well  these  in  vitro  assays  can  cap¬ 
ture  the  complexities  of  clearance  in  vivo  [86,88]. 
In  attempting  to  predict,  via  in  silico  methods, 
an  in  vitro  end  point  such  as  stability  in  human 
liver  microsomes,  one  is  in  fact  building  a  model 
of  a  model.  Nevertheless,  in  silico  methods  may 
be  useful  as  a  means  of  filtering  larger  libraries 
of  compounds  much  more  cheaply  and  quickly 
than  with  in  vitro  screening  methods. 

■  Databases 
BioPrint 

The  BioPrint  database,  developed  in  1999,  is 
produced  by  Cerep.  Rather  than  data  extracted 
from  the  literature,  BioPrint  contains  data  mea¬ 
sured  in-house  in  Cerep ’s  own  set  of  159  stan¬ 
dardized  assays.  As  of  2012,  the  database  con¬ 
sists  of  a  set  of  approximately  2500  known  drugs 
and  reference  compounds.  The  assays  relevant 
to  metabolism  are  metabolic  stability  measured 
in  human  liver  microsomes,  and  inhibition  of 
CYPs  1A2,  2B6,  2C8,  2C9,  2C19,  2D6,  2E1, 
3A4  and  3A5  [257].  QSAR  models  using  the 
in  vitro  assay  results  as  descriptors  along  with 
3D  pharmacophore-based  molecular  descriptors 
have  been  constructed  by  Cerep  to  allow  the  pre¬ 
diction  of  in  vivo  end  points  such  as  adverse  drug 
reactions  once  a  new  compound  has  been  run 
through  the  panel  of  assays  [89] .  Cerep  sells  this 
type  of  profiling  as  a  service,  but  the  BioPrint 
database  itself  is  also  available  as  a  commercial 
standalone  product. 

WOMBAT-PK 

World  Of  Molecular  Bioactivity  (WOMBAT) 
-PK  was  first  released  in  2005  by  Sunset  Molecular 
Discovery,  as  a  subset  of  the  WOMBAT  database, 
to  integrate  medicinal  chemistry  knowledge  with 
pharmacokinetics  data  [90],  The  2010  version  of 
WOMBAT-PK  contains  1260  drugs  from  text¬ 
books,  FDA  databases  and  the  literature.  Data 
fields  relevant  to  metabolism  include  systemic 
clearance,  apparent  systemic  clearance,  non- 
renal  clearance,  half-life,  terminal  half-life  and 
phase  I  metabolizing  enzymes.  However,  these 
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metabolism  and  clearance  data  are  not  available 
for  every  drug  in  the  database  [258] . 

DrugBank 

DrugBank,  from  the  University  of  Alberta 
(Canada)  [259],  is  a  well-known  free  database  of 
6711  drug  compounds  in  version  3.0  (approved 
drugs,  nutraceuticals  and  experimental  drugs). 
It  links  cheminformatics  and  bioinformatics 
by  including  target,  metabolizing  enzyme  and 
transporter  information  for  each  drug.  For  a  sub¬ 
set  of  the  drugs,  metabolism  data  such  as  half-life 
and  clearance  pathways  are  included,  but  this 
information  is  presented  in  free  text  format  as 
extracted  from  journal  articles  [9i], 

AurSCOPE  ADME 

The  AurSCOPE  ADME  database  from  Aureus 
Sciences  consists  of  a  set  of  40,635  compounds 
(as  of  June  2012)  collected  from  the  literature, 
patents,  known  drugs  and  new  drug  applications 
in  the  USA  and  Europe.  The  metabolism-related 
properties  that  are  included  are  binding  affin¬ 
ity  and  kinetic  data  for  various  phase  I  and  II 
enzymes,  along  with  human  liver  microsome  and 
hepatocyte  stability  data  [260],  Recently,  Aureus 
has  collaborated  with  Institute  of  Biomedical 
Chemistry  to  develop  AurPASS,  in  which  the 
PASS  modeling  methods  are  used  with  the 
AurSCOPE  database  to  predict  a  variety  of 
biological  activities  [261] . 

Evolvus 

The  company  Evolvus  [262],  which  specializes 
among  other  things  in  chemical  data  curation, 
offers  a  hepatic  clearance/microsomal  stability 
dataset,  containing,  as  of  May  2010,  approxi¬ 
mately  5000  compounds  tested  in  humans  and 
other  mammalian  species.  The  data  have  been 
collected  from  the  recent  literature  (since  2002). 

Other  free  databases 

Freely  available  resources  include  two  datasets 
for  clearance  and  half-life,  extracted  from  the 
Goodman  &  Gilman  textbook  [92]  and  available 
at  QSAR  World  [263].  The  ChEMBL  database  of 
bioactivity  data  at  the  European  Bioinformatics 
Institute  [93,264]  contains,  in  version  14,  over 
9000  half-life  or  metabolic  stability-related 
assays  from  the  literature,  in  particular  a  rela¬ 
tively  large  (669  compounds)  set  of  human 
intravenous  administration  half-life  data  [94], 
PubChem  [66,233]  contains  two  assays  from  the 
Conrad  Prebys  Center  for  Chemical  Genomics 
at  the  Sanford-Burnham  Medical  Research 


Institute,  measuring  metabolic  stability  in 
human  and  mouse  hepatic  microsomes  [265,266] . 
At  the  time  of  writing,  these  assays  are  set  up 
as  placeholders  containing  one  compound  each, 
but  a  full  set  of  data  is  planned  to  be  released. 
These  last  four  resources  are  discussed  in  more 
detail  in  the  accompanying  paper  [15],  where  we 
use  them  to  build  QSAR  models  of  half-life  in 
human  liver  microsomes. 

■  Software 
ACD/Percepta 

Aside  from  the  databases  containing  half-life 
and  clearance  information,  mainly  for  relatively 
small  sets  of  known  drugs,  there  are  very  few 
publicly  available  resources  useful  in  the  context 
of  metabolic  stability  predictions.  ACD/Percepta 
[224]  has,  to  our  knowledge,  the  only  explicit  pre¬ 
diction  of  half-life  in  human  liver  microsomes.  It 
uses  a  random  forest  classification  model,  trained 
on  literature  data,  with  its  CYP  substrate  and 
regioselectivity  predictions  (discussed  above) 
as  descriptors.  Input  compounds  are  classified 
as  ‘stable’,  ‘unstable’  or  ‘undefined’,  where  the 
latter  can  either  mean  that  the  compound  is  of 
intermediate  stability  or  that  the  reliability  of 
the  prediction  is  too  low.  Users  can  adjust  the 
range  of  half-life  times  that  should  be  considered 
‘undefined’. 

ADMET  Predictor 

In  ADMET  Predictor,  the  Metabolism  Module 
(discussed  in  more  detail  above)  can  predict 
K  and  V  for  hydroxylation  reactions  by  five 
recombinant  CYPs  (1A2,  2C19,  2C9,  2D6  and 
3A4)  and  converts  these  into  an  intrinsic  clear¬ 
ance  value  for  each  CYP.  It  can  also  provide 
a  more  general  ‘CYP_Risk’  score  for  a  query 
compound,  based  on  predictions  of  high  clear¬ 
ance  by  any  of  the  CYPs  along  with  inhibition 
of  3A4  [213]. 

MEXAIert 

The  program  MEXAIert  from  CompuDrug, 
a  simplified,  high-throughput  version  of 
MetabolExpert  (discussed  above),  can  be  used 
to  flag  compounds  that  are  likely  to  undergo 
extensive  first-pass  metabolism,  and  thus  be 
rapidly  eliminated  before  reaching  the  systemic 
circulation  [267] .  Rather  than  an  extensive  graph¬ 
ical  tree  of  metabolites,  MEXAIert  produces  a 
simple  text  table  with  each  input  compound 
flagged  as  ‘Probable’  or  ‘Not  Probable’  that  it  will 
be  metabolized,  along  with  a  list  of  predicted 
biotransformation  reactions. 
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QikProp 

An  interesting  approach  to  metabolic  stability 
prediction  is  found  in  the  program  QikProp  from 
Schrodinger,  first  released  in  2000,  and  devel¬ 
oped  by  the  Jorgensen  group  at  Yale  University 
(USA)  [268].  QikProp  can  calculate  a  descriptor 
called  #metabol,  which  tests  whether  an  input 
compound  can  undergo  any  of  a  set  of  21  meta¬ 
bolic  reactions.  The  output  is  a  number  between 
0.0  and  8.0;  a  higher  number  indicates  that  a 
compound  is  more  likely  to  be  metabolically 
unstable.  The  development  of  this  property 
has  not,  to  our  knowledge,  been  described  in 
any  of  the  publications  about  QikProp,  so  it 
is  not  clear  how  well  the  number  of  potential 
reactions  a  compound  can  undergo  correlates 
with  its  actual  reactivity  or  metabolic  stability. 
In  a  QSAR  modeling  study  on  the  metabolic 
stability  of  calcitriol  analogs,  using  QikProp 
properties  among  others,  Jensen  et  al.  evaluated 
several  variable  selection  methods,  and  found 
that  while  the  #metabol  descriptor  was  chosen 
some  of  the  time,  it  was  not  ultimately  one  of  the 
most  predictive  descriptors  [95] . 

Holes  in  the  bucket 

A  caveat  of  all  2D  structure-based  approaches, 
such  as  QSAR  models,  is  the  inability  to  handle 
enantiomer-specific  metabolism.  This  may  not 
be  a  big  issue  as  many  pairs  of  stereoisomers 
are  observed  to  give  rise  to  the  same  metabo¬ 
lites  [45].  However  substrate  stereochemistry 
has  been  known  to  play  an  important  role  in 
CYP-mediated  metabolism.  For  example,  2CI9 
is  known  to  catalyze  6-  and  8-hydroxylation, 
but  not  4'-hydroxylation  of  -/^-warfarin.  On  the 
other  hand,  2C19  catalyzes  4'-hydroxylation 
but  not  6-  and  8-hydroxylation  of  5-warfarin 
[96] .  Tools  that  explicitly  consider  3D  molecular 
conformations  of  both  the  substrate  and  the  pro¬ 
teins  are  required  for  predicting  stereochemistry- 
specific  metabolism.  These  tools  are  much  more 
resource-demanding. 

One  of  the  drawbacks  to  biotransformation 
prediction  methods  is  that  they  tend  to  lump 
together  biotransformation  data  from  different 
species  and  tissue  types  [23,29]  (an  exception  to 
this  is  TIMES,  though  its  liver  simulator  is  built 
on  data  from  humans,  rats  and  dogs  [40]).  Also, 
in  spite  of  attempts  in  each  software  program  to 
prune  the  resulting  metabolic  trees,  a  great  many 
metabolites  are  still  generated  for  each  input 
compound,  not  all  of  which  will  be  observed 
experimentally  [29].  Because  of  this,  these  pro¬ 
grams  are  difficult  to  apply  in  a  high-throughput 


fashion.  The  advantages  of  biotransformation 
prediction  methods  are  that,  unlike  other  meth¬ 
ods  discussed  herein,  biotransformation  predic¬ 
tion  methods  are  not  focused  on  CYPs  and 
so  can  predict  metabolites  produced  via  other 
metabolizing  enzymes.  Furthermore,  the  combi¬ 
natorial  enumeration  of  metabolites  can  be  use¬ 
ful  in  the  context  of  identifying  the  compounds 
present  in  mass  spectrometry  data  [29]. 

The  drawbacks  of  metabolic  site  prediction 
methods  are  that  they  generally  cannot  evalu¬ 
ate  the  reactivity  of  different  molecules  relative 
to  one  another,  only  different  sites  within  the 
same  molecule  [42],  They  also  cannot  generally 
predict  which  specific  CYP  will  metabolize  a 
compound,  or  even  whether  or  not  the  input 
query  compound  will  actually  be  metabolized, 
only  where  it  would  be  metabolized  if  it  were 
a  substrate  (exceptions  to  this  are  StarDrop, 
with  its  composite  lability  prediction  [210],  and 
ADMET  Predictor,  which  prefaces  its  site  pre¬ 
dictions  with  a  substrate  classification  predic¬ 
tion  [213]).  Metabolic  site  prediction  methods 
can  be  useful  for  optimization  of  a  given  lead 
compound  to  improve  its  metabolic  stability 
[2,13],  but  generally  not  for  selecting  a  lead  com¬ 
pound  out  of  a  larger  set  of  screening  hits,  or  for 
database  filtering  prior  to  sample  acquisition  or 
synthesis. 

Ideally,  one  would  like  to  be  able  to  predict 
the  rate  of  metabolism  for  an  input  compound, 
compared  with  other  compounds  in  the  dataset 
under  analysis.  Rates  of  metabolic  reactions  are 
important  because  the  extent  of  metabolism  of 
a  compound  and  which  sites  are  predominantly 
metabolized  is  (often)  kinetically  driven.  Rate 
predictions  are  also  a  way  of  ranking  the  extent 
of  reactivity  of  one  compound  against  another, 
rather  than  simply  ranking  the  sites  within  a 
single  compound.  In  spite  of  the  importance 
of  this  parameter,  there  are  very  few  software 
packages  that  address  metabolic  reaction  rates, 
probably  because  of  a  lack  of  good  consistent 
training  data. 

There  are  also  very  few  software  tools  for 
predicting  CYP  induction,  and  none  (to  our 
knowledge)  for  CYP  activation  prediction.  This 
is  probably  a  function  of  the  relative  newness 
of  this  field  (the  nuclear  receptor  mechanism 
for  CYP  induction  was  only  discovered  in  1998 
[58])  and  of  the  complexity  of  the  underlying 
mechanisms,  confounding  the  development  of 
structure-activity  relationships. 

The  prediction  of  phase  II  and  other  non- 
CYP  substrates  is  another  area  of  metabolism 
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prediction  where  there  is  very  little  publicly 
available  software.  This  lack  of  models  for  pre¬ 
dicting  substrates  and  inhibitors  of  non-CYP 
metabolism  enzymes  is  concerning,  particularly 
in  light  of  a  recent  meta-analysis  of  drug  metabo¬ 
lism  pathways  and  metabolites  extracted  from 
current  literature.  Testa  et  al.  found  that  a  full 
40%  of  the  reactive  toxic  metabolites  described 
were  quinones  and  analogs  thereof  (quinon- 
imines,  quinonimides,  quinone-diimines)  [97], 
These  can  be  produced  by  CYPs  but  also  by 
peroxidases,  an  enzyme  family  that  has  essen¬ 
tially  been  neglected  in  the  field  of  metabolism 
prediction,  and  whose  tissue  distribution  and 
substrate  specificity  is  very  different  from  CYPs. 
Fortunately,  public  data  on  substrates  of  non- 
CYP  metabolism  enzymes  are  available  from  the 
literature,  and  could  be  extracted  from  several  of 
the  databases  discussed  here,  including  Accelrys 
Metabolite  [201],  MetaBase  [203]  or  CTD  [246] . 

Future  perspective 

It  is  customary  at  this  point  in  a  review  paper  on 
property  prediction  methods  and  tools  -  whether 
they  are  metabolism-related  or  dealing  with 
other  properties  in  the  drug  design  and  ADMET 
areas  -  to  call  for  more  public  data  and  to  hope 
for  the  sharing  of  the  large  proprietary  datasets 
available  in  pharmaceutical  companies,  so  here 
we  will  continue  that  tradition.  The  extent  to 
which  many  of  the  software  programs  listed  here 
are  based  on  the  same  data  (from  the  literature, 
from  the  Accelrys  Metabolite  database  [201] ,  from 
the  relatively  small  set  of  existing  drugs  and  pat¬ 
ented  pre-clinical  compounds,  and  most  recently 
from  PubChem  [233])  is  striking.  Of  course,  all 
developers  of  publicly  available  software,  be  it 
free,  academic  or  commercial,  face  the  same 
constraints  on  data  availability.  It  may  be,  as 
observed  in  the  analysis  of  bioactivity  databases 
by  Tiikkainen  and  Franke  at  Merz  Pharma  [98], 
that  the  literature  data  used  by  different  vendors 
do  not  overlap  all  that  much,  due  to  different 
time  frames  being  used  for  the  data  extraction 
and  collection.  However,  in  many  cases,  the  data 
for  training  sets  or  dictionary  development  have 
been  extracted  from  very  old  literature.  Old  data 
are  not  necessarily  wrong,  but  are  perhaps  not 
fully  representative  of  the  new  chemistry  space 
being  explored  in  current  drug  design  and  dis¬ 
covery  work  [12].  Older  data  are  also  certainly 
not  consistent  with  newer  methods  of  metabolite 
detection  [99]. 

Mechanisms  for  publicly  sharing  the  large 
proprietary  ADMET  datasets  owned  by 


pharmaceutical  companies  have  been  consid¬ 
ered.  One  successful  example  of  this  is  the  non¬ 
profit  Lhasa,  Ltd,  developers  of  Meteor,  which 
has  set  up  a  membership  program  for  collab¬ 
orative  work  on  toxicity  and  metabolism  pre¬ 
dictions.  Generally,  the  actual  proprietary  data 
points  or  measurements  are  not  shared  between 
Lhasa  members,  but  the  data  are  used  to  develop 
and  validate  rules  for  the  toxicity  and  metabo¬ 
lism  knowledge  bases  [too].  The  potential  for 
sharing  ADMET  models,  if  not  data,  was  also 
explored  in  a  collaboration  between  researchers 
at  Pfizer  and  Collaborative  Drug  Discovery  [101] . 
Certainly  it  is  not  surprising  that  high-quality 
QSAR  models  can  be  built  with  open-source 
descriptors  such  as  those  from  the  Chemistry 
Development  Kit  [269]  and  open-source  model¬ 
building  tools  such  as  R  [270]  or  KNIME  [271]. 
The  issue  is  whether  descriptors  can  be  chosen  in 
such  as  way  as  to  capture  aspects  of  the  chemical 
structure  that  contribute  to  its  biological  prop¬ 
erties,  without  allowing  the  original  structures 
of  compounds  in  the  training  set  to  be  reverse 
engineered  [102,103]. 

In  calling  for  new  data  and  more  data,  it  is 
also  worthwhile  to  consider  what  kind  of  new 
data  would  be  the  most  useful.  Even  the  large 
datasets  in  big  pharmaceutical  companies  may 
be  narrowly  focused  on  sets  of  closely  related 
series  of  analogs,  and  perhaps  not  generalizable 
to  other  regions  of  chemical  space.  It  is  also  not 
clear  to  what  extent  data  from  different  labora¬ 
tories  or  different  assay  protocols  can  reasonably 
be  combined  [104,105].  The  screening  centers  of 
the  NIH  Molecular  Libraries  Program  [272]  have 
contributed  to  this  effort  by  producing  assays  for 
several  metabolism  end  points.  Some  commer¬ 
cial  software  currently  contains  QSAR  models 
based  on  these  data,  and  we  expect  to  see  more 
to  come.  It  is  to  be  hoped  that  these  screening 
efforts  will  continue,  with  PubChem  as  a  central 
repository,  even  in  the  face  of  funding  cuts  to 
the  program  [106]. 

Ultimately,  in  silico  models  are  based  on  one 
of  the  fundamental  tenets  of  pharmacology,  that 
chemically  similar  molecules  should  behave  in 
biologically  similar  ways.  The  presence  of ‘activ¬ 
ity  cliffs’  in  bioactivity  SAR  data  is  well  known 
[107,108],  but  metabolic  processes  may  be  particu¬ 
larly  susceptible  to  these  effects,  as  the  metabolic 
fate  of  a  compound  depends  strongly  on  a  large 
number  of  variable  factors  in  the  biological  sys¬ 
tem  into  which  it  is  introduced.  For  example,  a 
small  difference  in  distribution  relative  to  the 
liver  between  two  similar  compounds  could 
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result  in  a  very  large  difference  in  metabolic 
outcome  [23] .  In  addition,  there  are  many  varia¬ 
tions  in  metabolic  and  other  ADME  responses 
due  to  genetic  differences  between  individuals, 
and  within  the  same  individual  under  different 
circumstances  (e.g.,  pregnancy  or  aging)  [109]. 
This  suggests  that  there  is  a  fundamental  limit 
to  the  predictivity  achievable  by  any  structure- 
based  metabolism  modeling  method,  and  that 
that  limit  may  not  be  very  high. 

A  productive  avenue  for  increasing  the  usabil¬ 
ity  of  in  silico  modeling  would  be  the  improve¬ 
ment  of  algorithms  capable  of  extracting  what 
signal  there  is  from  noisy  data,  in  an  extremely 
rugged  activity  landscape,  without  overfitting. 
Along  with  this  it  would  be  important  to  have  a 
consistent  method  for  dealing  with  uncertainty, 
namely,  how  to  determine  what  level  of  confi¬ 
dence  to  have  in  model  predictions  for  new  indi¬ 
vidual  compounds,  and  where  to  draw  the  line 
between  reasonable  and  nonsensical  predictions. 
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Executive  summary 


a  This  review  focuses  on  publicly  available  databases  and  software  that  can  be  used  for  in  silico  prediction  of  metabolism-related 
properties. 

■  Computational  end  points  for  metabolism  predictions  include  regioselectivity,  biotransformations,  interactions  with  metabolizing 
enzymes  and  metabolic  stability. 

■  Programs  for  predicting  biotransformations  and  regioselectivity  are  highly  incestuous  and  based  on  many  of  the  same  datasets. 

■  We  tested  the  programs  PASS,  StarDrop,  isoCYP,  MetaDrug,  ADMET  Predictor,  Percepta  and  VirtualToxLab  for  predicting  interactions 
with  metabolizing  enzymes  and  found,  qualitatively,  a  60-75%  success  rate. 

■  It  is  of  concern  that  there  is  very  little  publicly  available  software  for  predicting  phase  II  and  other  non-Cyp  interactions. 

■  Future  improvements  in  modeling  metabolism-related  properties  could  be  realized  with  larger  sets  of  consistent  experimental  data  and 
careful  consideration  of  the  limits  of  model  predictivity. 
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